 is a female. I'm the only female here. I'm hoping that that improves. That's always a challenge to get female in STEM. But thank you for having me. Today I'm going to be talking about how to build a production big data pipeline. So before I start, I just want to get a raise of hands so I can adjust my speed at which I'm going. Is everyone here familiar or have dabbled with big data? Can I get a show of hands of who is familiar and have dabbled with big data? Okay, perfect. Then my talk is, there's like three people in the room, so my talk is apt because I lay out the context of what is big data. So if you know all of your familiar, I'm going to skip, but it's good. So before anything else, a little bit about myself first to introduce myself to you guys, to break the ice. First things first, I love data, of course. So I'm in the right field and I'm in the right team. I have 15 plus years of software development experience, mostly focus on data. With the past five years, focus on big data. So for the past five years, I've been the engineering manager for what we call the Autodesk data platform, or ADP for short, that is the de facto data platform used by the entire company, the entire Autodesk. It's built right here in Singapore. Now it's being maintained globally, but we started building the infrastructure and the whole idea concept here in Singapore. So also last 2017 and 2018, I led a team of UC Davis Masters of Business Analytics students into a project of what we called System Incident Prediction. During that time, there was no term for it yet, but 2018 is the period where everyone started, they want to be able to predict system incidents and now we have a term for it, which is what we call AI ops or AI operations. What I'm passionate about, apart from data, is strategy, technology and innovation. Also teamwork and team culture, me being an engineering manager, I want there to be a good culture within the team. And I love sports. I love tennis primarily and skateboarding, both of which are on hold because I just busted my knee. But there you go, a little bit about myself. All right, so what I'm going to be talking about today, I want to introduce what is big data, what's the big deal about it, right? And also a little bit, I will talk about the architecture of what we have for the Autodesk data platform. Granted, it's a bit high level, but I think it's just appropriate for what we have here. And I'm going to walk through how do we build a big data pipeline. What is a big data pipeline to begin with? Why do we want to build it? So I'm going to talk a little bit about that. So I have reserved some time for Q&A towards the end. So if you do have questions, I would just kindly ask you to, I'm welcome questions, please do ask them. I would just ask you to hold it till the end so that we have enough time to go through the entire presentation. Is that okay for everyone? Cool. All right, first and foremost, big data. What is so special about it, right? So if we think about data, can someone, especially for those people who haven't touched big data, which is most of the room, when I say the word data, what do you normally think about? How do we, you know, what is the technology that we use for data when I say data? Databases, great. That's right. A lot of numbers. Okay, that's great, which is great because those are the two things actually, two illustrations that I have. People usually think about numbers. I kid you not. In this day and age, people do still use Excel sheets as their database. Honestly, I don't, I don't know. But yes, the Excel sheets and databases, right? So when we talk about big data, what is the difference then between big data and data? What makes it so big? Right? So let's think about this for a while. In 2019, this is what happens in an internet minute in 60 seconds. 18 point, my eyesight is not so good and I don't want to block the camera, 18.1 million texts sent, 4.5 million views. I don't know, 87,500 tweets, 347,000 Instagram. I'm not going to go through all of it, but you can see in one minute, a lot of things are happening right now. And this is largely due to the advent of the internet, right? So can we all agree then that if I look at the numbers here that this, this data is not small, right? This is, this data is big actually, this data is, is massive. So it is, actually it is, sorry, why I had that. So let me share some facts about data with you, right? Actually 90% of all the data that we have in the world right now were collected over the past two years. So it boggles the mind that this been like, we've had like what, more than 5,000 years of history, but the amount of data that we have has been collected in the last two years. That's just how it blew up, right? And by 2025, it is estimated that we would have around 463 exabytes of data. That is why I put this, just to give context, what is an exabyte of data? Exabyte is here. It is 1,000 to the power of six and there's a lot of zeros over this. So if I say it's huge, it's massive, right? So if we look at that and we agree that this is big, right? What are the aspects or what are the different areas of this that makes it big data? First and foremost, it's the volume. We can see 1 million, 1 million, 2.1 million, it's big. It happens in 60 seconds, just to say. So volume is quite huge. The velocity at which this data is coming through also is quite fast, right? In order for us to get 2.1 million snaps in 60 seconds, it has to go through at a very fast rate and it's not a rate that is easily handled, certainly by an Excel sheet, right? And then also if you look at the third aspect of it, it's a lot of different variations of data. It's not just texts, there are videos, there are music, there are pictures, it's a lot of different kinds of data. So going back to traditional thinking, right? Do you think that these are going to be easily handled by a database? Honestly, I think not, right? And not only, so basically when we talk about big data, usually big data comes with three Vs and these are the three Vs, volume, velocity and variety. But as more and more data are collected and as more and more people and companies are processing data, we come up with new Vs every day. So I'm not going to talk about all the Vs, but for big data, this is the three primary Vs that you have to be aware of, right? And so if we're going to talk about that, if we think about it, how do we collect and store these kind of data? Again, database, yes, maybe you can store it in a database, but it's not going to be usable. And then how do we process these data to find two more Vs? Veracity. How do we know how true the data is with all the fake news and fake information out there? How do we know if your data has high integrity and has high accuracy? And the next one is how do we unlock the value of this data? In order for us to unlock the value, we need to be able to do, to process the data, to do some analytics on it, right? To slice and dice and finally do predictive analytics on it, right? In order to do that, we need the traditional technology that we have used in the past is not sufficient. It may be able to store the data, you can store it in a database and a blob. When I'm talking about database, I'm talking about relational databases. Relational databases are not really good for these kinds of things because if you think about it, how do you store music? How do you store pictures? And storing is one thing. How do you enable processing on videos? How do you enable processing on pictures? Databases, traditional relational databases are not good enough. So let me share a little bit on how we are handling these kinds of big data in Autodesk. So a little bit of context. For big data at Autodesk, we are talking primarily about product usage data. If you're not familiar with what products Autodesk has, these are AutoCAD, Maya, Inventor, Revit, these are our hero products. We have over 100 products and so when all those 100 products have usage data, have telemetry data, we collect them, it becomes huge. So the good thing about the data that we collect is it's kind of semi-structured. There's some common schema that we slap on top of it so that it's easier to process. It's not like, you know, blob here, text there, video here, it's not. It's text so it's semi-structured and it has a common schema. It is also high volume, high velocity data and just to give an example, our average volume of data that we are receiving per hour is 110 gigabytes, uncompressed, and 73 million records per hour and this is growing. This is on average. At a max peak, we were receiving 157 gigabytes of data per hour and 110 million records per hour. It sounds big but it can be small so let's keep that in perspective and do a comparison. So in comparison, the average tweets per hour is only around 21 million, which if you do a rough calculation of a byte math, it translates to around six gigabytes. Six gigabytes is a conservative figure because in Twitter, although you have the character limit, you can also post pictures, you can also post like GIF. So even if you blow that up by 10 times to 60 gig, it is still only half of what we are processing in data, in Autodesk for our data. So this is just a perspective of the volume of data that we are processing. So to handle this kind of high velocity, high volume, semi-structured data, I'll talk a little bit about the architecture of what we've employed in Autodesk. So it's not a silver bullet that is able to handle this. You have to have data collection. You have to have a storage strategy. You would also have to have different tools to access and process these data. So I've kind of separated it into big blurbs. And I'm going to talk about the blurbs in a little while. But like I said, basically we have data collection. So we have two patterns for data collection currently. We're increasing the patterns. But basically what we want are ingestion patterns, not use cases that, oh, I'm going to ingest this. This is a use case that I have right now. I'm going to create one mechanism or one ingestion for it, but it's not usable. What we want are patterns, patterns means that it's plug and play. Today it's going to be used by this team and the company tomorrow because it's a pattern, it's a plug and play for another team. So right now we have two of that. We also have data exploration. So once we've collected the data and once we've cleaned it, we want to be able to unlock that data to the internal users. We want them to be able to do their crunching and their analytics based on the data that we collect. And ETL, for those of you who are not that familiar with data, it's okay. ETL usually stands for extract, transform, load. This is the industry term that we use for batch processing, basically. For the data that we collect, we want to do some aggregations on it to come up with different numbers that executives are keen on or are interested. And ETL is what we call. And finally, ETL and the data exploration serves into the data visualizations. So the dashboards that management see, actually there's an engine behind it. It's not raw data into a dashboard. That is in the realm of science fiction. So there's actually a lot going on before you can populate a dashboard. And I will talk a little bit about that later on. But first, let me talk about the data collection pattern one. So for us, like I said, we have a common schema. And so for each and every product, we do have an SDK. It depends on whatever programming language that they use, but we have a multitude of different language SDKs that they can integrate into their products to collect the data that we have and to collect the data and follow the common schema that we have. There are also some services, meaning internal services that we use to authenticate users, to authenticate licenses and all these things. They also instrument using our SDKs and they send data to our central data lake. So all of them are, this is an abstraction of it. There's a lot more happening in the middle, but ultimately it lands into AWS service called Kinesis Firehose. And Kinesis Firehose is like a queuing. It's like a message bus. It's like a message queue. It's able to handle high capacity, high volume data with a lot of reliability. So recently, our architecture was not as simplified as this. So when we built the Autodesk data platform, it was five years ago when I started out with the team. And five years ago, Amazon does not have big data offerings. So we used to have to do it on our own. We used to use Kafka. We used to use a lot of messos to orchestrate all our clusters, but recently we moved to what we call hug the bear. We want to make the most out of AWS services so that we don't have to develop this on our own. So now it's sending to a Kinesis Firehose. That's a pattern number one. And actually, on top, you have the users. So these are different kinds of users. So for data, I know it's like a kind of a big dump into you guys, but there is a steep learning curve for data. And I'm trying just to give you guys a high level understanding so that you can get into why I want to build a big data pipeline. So for big data, there are three, mostly three kinds of users. One is a data analyst. Data analyst, what he or she does is look at the numbers as it is right now and do some analytics on it to get some insights from the data that we have right now. Another role in data is data science. Data science is more around predictive analytics. Based on the data that we collect right now, what do I project the pattern to be? What do I project the trend to be? Therefore, if the trend is going to be this way, we should follow this trend and do these certain actions. It's more prescriptive. And last but not the least is a data engineer. What does a data engineer do? The engineer does all of this work on the side, on the back, to set up infrastructure so that the data scientists, the data analysts, and other people in the company can make use of the data. So things like this, a data scientist will not know how to integrate them together. Data analysts forget it. They will not know. So these are the users. I went on a tangent. But there are users. And for the SDKs that we have, we have what we call remote control data collection. So if you're familiar with, you may not be familiar, but there's a recent law that came out from EU, which is called GDPR. It's all about data privacy and data protection. So because of that, we cannot always be collecting data. It cannot be always turned on. There are only certain spurts when we can collect data. And it should be for a purpose. So this is what we call ethical use of data. And in order for us to facilitate that, we have this service called the Experiment Portal wherein this serves as the remote control to trigger the data collection to toggle it on or off. So once those data analysts say that, hey, there's this question that I want to answer. How many users is using AutoCAD version 2020 release this feature? They will do some things on the Experiment Portal, push that out into all the AutoCADs that are released out there. Those AutoCADs use our SDK. They're instrumented. They will start collecting data. In this experiments, they have parameters. I just want to know in the U.S. how many people, so they're going to limit the scope of it to U.S. So that's what the Experiment Portal is. It is backed by Amazon Aurora, which is a key value database basically. And our storage is Amazon S3, object storage. So that's ingestion pattern number one. The second data collection or ingestion pattern is just simple. Any unstructured or semi-structured data, we just need our users to dump it into S3. And we have an automated system backed by Lambda to ingest it into our raw data. So the raw data is a red bucket. And there's a different bucket for process data, again, because of GDPR, data privacy, and security. And also not just that, for processing needs in order for us to be able to process fast and return results fast, we need to be able to segregate the data. So the S3 buckets here, here, and you will see another bucket later. This is what we call the Data Lake. Data Lake is a standard industry term as well for big data. It's deep. It's murky. There's a lot of confusing data around there. You need to make sense out of data. So next, I will talk about ETL. So after the data has been collected, we need to do some processing on it. In order for analysts, in order for data scientists to be able to make sense out of the data, first and foremost, we have to clean it. Because even though we are collecting through SDKs, some instances can happen that causes data to not be clean. For example, some packets are lost, then data is corrupted. Or a Weisscrack engineer did not instrument it properly. It's expecting an integer data type and they send us a Boolean. So these kinds of things, they have to be weeded out. They have to be cleansed in order for data scientists to be able to use that data that we collect. So the underlying big data technology that we are using is Spark. And I will talk a little bit about why do we need a big data technology to begin with. And the underlying scheduler that we are using is Uzi. Again, I will talk a little bit later about why we need a scheduler. Why can't I not just use a cron job? So just keep that in mind. I will talk about it in the next few slides. But our big data technology that we're using is Spark. Our scheduler is Uzi. And what we do basically is the data analysts and the other data engineers in the company, they use these technologies to build aggregations. Again, like I said, what are aggregations, right? So when the data is in the data lake, it is deep and murky and it's huge. So in order to answer one particular question, you don't need all of the data. You need an aggregation, a group buy of that data, some number, group buy this field, give me this number. What are all the licenses that we are using? How many licenses are being used? You don't need what do you call that? You don't need product information. I just want how many licenses are being used. If I want to know how many licenses are being used per product, then yes, I need the product information, but I don't need the country information. If I want to know how many licenses are used per product, per country, then yes, I need the country information. So these are different kinds of aggregations to answer very specific questions. And those are what we call data cubes. Once an aggregation is built, it becomes part of the data pipeline and it produces a cube that is usable by data analysts and data scientists. And then we put it to the green S3 bucket over there. It is green because it's open to the public. This one is red because its raw data access is primarily restricted. And then after we do some cleansing, we do some aggregation on the data, we open it up for data exploration. So what is data exploration? Some analysts in the company or actually some product manager in a company. Let's say I'm the product manager for AutoCAD. I just released a new feature, feature A. I want to know how well is my feature A doing? How many crashes have been caused by this feature A in the product? So all the data is there in the green bucket, but how do I access it? I am able to access it through these technologies. Hive is our meta store. So it's basically metadata about the data. And then we use the Presto engine. Presto engine is developed by Facebook, but it's a columnar database, basically. And for big data, usually columnar patterns and columnar file formats and databases are a lot faster because they get rid of all the other things that you don't need. If I just want the country, I can get the column of the country. I don't need all the rows of all the data that I have and then filter by country. So it doesn't work that way. Columnar is the way to go. And also, some of them are data scientists. I want to do predictive analytics. For data scientists, mostly they use R or Python. We need to provide interface for them to be able to access our data and use R or Python language on top of our data. So that's why we provide that Hive. It's a JDBC interface, basically. So they can connect their Jupyter notebooks there. They can connect SQL workbench for SQL-like query languages and other things. But basically, these are the supported ones. And finally, after all the crunching, research, and exploration, we finally, after all that, get to the visualization. So for visualization, our de facto tool, BI tool that we support is Looker. And it is backed by Amazon Athena for faster querying. Looker is what we support, but we can also have Tableau. We can, oh, finally, a female. Welcome. Yeah. So Looker, as I was saying, Looker is a de facto BI tool that we support. But because of the JDBC interface, we're able to support Tableau. We're able to support ClickView as well. Yeah. So this is overall the big data infrastructure, architecture that we have to handle the product usage data. Right. So remember I said, why do we need a big data technology to process big data? Is it not as simple just as a database or querying? And why do we need a scheduler? Why not just Cron? So for the big data technology, usually, it operates on the concept of MapReduce, right? And why do we need MapReduce? So what is MapReduce? MapReduce operates on the notion of instead of bringing data to the processing, which can be costly, we bring processing to the data. Now, I give you a very basic example. For example, you're using, you're doing a word count, right? And you want to write an application for word count. Imagine that the file, your input is 10 gigabytes. If you write an application using shell script, let's say, you know, grep and then WC minus L, right? You can get a word count. That actually downloads the 10 gigabytes of data into your local or wherever, whichever environment you are running your application on. And that download alone takes a long time. It's 10 gigabytes. We're talking about 110 gigabytes here. So instead of bringing the data to the processing, what we do is let the data stay there. We bring the processing code to the data, because usually the code is small anyway. The source code for it is MBs at best, right? So this is what MapReduce is all about. So for MapReduce, this is what it usually does. So for example, a word count, right? There's XBB, CBA, XAC. Actually, there's character count. I want to count how many characters there are. So what they do is actually they split, oh, by the way, sorry, I forgot to mention, one other thing that is great with MapReduce is they can parallelize it. Of course, you can also write a script that, you know, has threading and all these things, but it becomes very difficult to maintain, very difficult to track. So for MapReduce, they do that innately. They parallelize it innately. So you have three lines here. For example, they will have three nodes, what we call partitions, right? And they will allocate like the first line on that node, the second line on the second node, and the third line on the third node. And you will employ the same script, the same logic for counting in all the three nodes. So what it does, it is on the first node, it will split. I have one instance of X, one instance of B, and another instance of B. So X1, B1, B1. Same thing is, same logic applies in the second node. I have one C, I have one B, I have one A, and so on. And the third node, I have X. And what happens is that there's a shuffle that happens in between to make sure that the nodes only, what we call the partitions, only have one character allocated them. In this instance, it's one character, but in a different example I'll give later. You can understand a little bit better what I mean. So for A, for the partition of A, I have two instances of A based from this once they shuffle. Same thing with B, I have three instances of B, two instances of C, and two instances of X. And after they combine and sort in a step that we call reduce, which is the last one, the reduce will actually give us the counts. So all together I have two A's, I have three B's, I have two C's and two X's. This is what happens in MapReduce. Now it's a very basic example, how do we use it? For example, again, we are collecting product usage data. Data comes from AutoCAD, Maya, Revit, and Venture. If I want to segregate their data, how do I do it? It's the same pattern. We will just partition it by product and count the number of whatever filter that I want to give it. Count the number of licenses per that product. It will operate in the same manner like this. Okay, so that's about MapReduce. Now, why Spark? Okay, so MapReduce, this one has been around for a long time. It started out with the technology that we call Hadoop. Some of you may have heard it. It's that elephant. So Hadoop actually, when it came out, it was a really great, it was a huge jump in the processing time of processing big data. It was a breath of fresh air. But actually, MapReduce has a lot of improvements that can be done on it. Although it's fast, they're always iterate and they'll always come up with something better. So why Spark? In traditional MapReduce, data sharing is very slow. Because why? Because in order for them to be able to process, read and write the data, they have what they call Hadoop file system. And that is how they're able to split into different nodes. They have to have their own file system. But in order for them to share that data for intermediate results, let's say you have MapReduce one followed by a second MapReduce in order to get to the final result. The intermediate results, they have to do IO. They have to write, persist it to disk. And that is why data sharing is very slow in traditional MapReduce. So it goes something like this. For example, data is on the disk. They will load it to their Hadoop file system. They will do the maps first. And then they need to write whatever they map on disk. So if you look at this example, they need to write this to disk first before it can be reduced. And that causes IO. So it spends around 90% of the time on serialization, replication, and IO just because of this. So what Spark does is that it came out with this notion of what they call, it's a shared memory. So what Spark does is that it processes data in memory. And that's what makes it 10 to 100 times faster. And they came out with this notion of what they call resilient distributed datasets or RDDs for short. So in between, instead of writing to disk, they have a Spark context. And as long as you are in that Spark context, you can share the memory. And that's what makes it fast. So in Spark, it removes expensive operation by introducing shareable memory, what I said, resilient distributed datasets. And this is how it then looks like. So we have an iteration of map first, and the memory is distributed, it's shared. And then the second map, and then the third map, and then finally, another thing about Spark is that in Hadoop, every map requires a reduce, even though sometimes logically you don't require a reduce, it maps reduce, maps reduce, and writes to file. In Spark, you don't need to reduce. You can map, and map, and map, and then finally reduce in the end. This is also an optimization. So Spark has a lot more optimizations on why it's a lot faster than Hadoop. And right now we are at Spark 2.4, or 3.0 is like in beta, but 2.4, there are a lot more. But I'm just talking about the biggest thing over Hadoop that they have on why it's 10 to 100 times faster. So this is why we need the big data technology. Because of the amount of volume, the volume of data that we have, it's very expensive and sometimes impossible to bring the data to the processing. We have to bring the processing to the data, and that's why we need the big data technology. So next, why do we need a scheduler? Why not just cron, right? So we are talking about, again, I need to remind you guys, it's big data. And for big data, sometimes you cannot, actually, let me take that back. Most of the times you cannot process in sequence. You have to paralyze it. So let's say start action, you can fork into three actions, and then those three actions further fork into three actions using cron. Yeah, possible, but is it the best way to do it, right? So we have what we call a directed acyclical graph. So it's just a fancy word, honestly. It's a conceptual representation. So we break down the words, a graph. It's a graphical representation. Directed means a single directional one way, and acyclical means there are no loops. So directed acyclical graph, right? It's single directional flow with no loops that's represented in a graph. And then it's most often used in data processing for a series of computations run on data to prepare for one or more ultimate destinations. And there can be more than one path in the flow. So an example of a directed acyclical graph looks like this. And it's very small, but this is taken from our production system. So here you can see it's a fork. In this particular instance, it shows this fork, and it's still running. Yellow is running. This means it hasn't succeeded or failed yet. So for the DAG that we are using, we are using Uzi as a scheduler. Okay, so enough about theory. Let's put it all together and let's build a big data production pipeline. So no, I don't have live demos. I will just share with you some scripts and some code on how you do it. Because in 30 or 45 minutes, honestly, it's we're not going to be able to cover all that if I do a live demo. So basically, first you create a MapReduce function. So I know, oh, it's a little bit small, but this is our production code as well. So if you can see this one is cleaning the raw data. So we've specified three kinds of cleansing that we want to do here. One is the valid logs. Second one is invalid logs. And third one is corrupted. Very simple, corrupted. It's not a JSON cannot parse it. Our file format is JSON. Invalid means there's a file, there's a data type that it's expecting, and the incoming data did not conform to that data time. For example, I'm expecting a string. They gave me a Boolean. It doesn't conform. So, so we have to do. So this is based off of the programming language used here is Scala, because Spark is natively built on Scala. Spark does support Java as well, Python as well, but for us, we chose to write in Scala nearest to the underlying technology that is processing the data for, you know, to don't to not lose the memory overhead for JVMs, basically. So in the, I will, I'm creating a log path array. And in that array, I have specified three, right? So valid invalid and corrupted. And then I'm going to do dot par. Par means parallelize. So I want to parallelize this execution in a dot map. Again, dot map, like what I showed you, it just segregates it into the category. So what this does is that given the inputs, I'm going to segregate it into valid invalid corrupted. That's what it does in parallel. And then, so validate is an internal function doing something else. Let's forget about that. And then finally, if we do some, we do some counts, obviously invalidate. So invalidate, if we do some counts, if it passes, then we just say that the counts of the valid, corrupt and invalid are correct. And it's successful. And then we go ahead and save this RDD. What save RDD does is just it writes to S3, nothing else. And then if not, it just says error basically. Oh, and by the way, again here, rdd.par, parallel mapping saved to RDD. So parallel saving to S3. That is what makes it very fast. So this is very simplistic and very simple example of a map reduce function in Spark. Then after you've compiled this, this runs, you know, you've verified it, you've debugged it, it works fine. What you need to do now is schedule a workflow. Again, we're building a production data pipeline. And in the end, I will tell you why do we need to build it. Okay, so we scheduled a workflow. This workflow is the action, this is the action. Okay, I tell it that you do the raw clean for me. This is the configuration. And that's that. And I schedule it every day. Right. So Wozzy's in taxes. Don't ask me it's XML. Yes, it's outdated. But this is what it is. Then once it ran, of course, you have to verify your data. So I've redacted some data over there because I don't want to violate anything. So can you see, is it very small, the text here? Oh, I'm so sorry about that. It looked fine in my big screen. But basically it's a select from that table that I was creating. And I just did the partition pruning on the date. That's it. Simple. And a limit by 20. So this is the result it gave me. That means my job ran successfully. This data is now there in the in S3. And then after everything, I create a visualization on top of it. This shows me like how many how much raw data I got of those data, how much are invalid, how much are corrupted. And we have a second version, a second part of the processing, which has abnormal and delayed and all these things. But basically, this is a dashboard that I've created. Now to answer the question, why do I need to build a big data pipeline? It's because this have to be updated every day. So when executives are asking questions, how many licenses are being used? The questions are not usually just a one time basis. They're usually asking on a daily level, how many licenses do we have today? How many subscribers do we have today? How many subscribers do we have tomorrow? Yesterday? How many? How what's the trend? How do we compare? So in order to establish that trend, and in order to populate dashboard that updates on a daily level, you need to build a production data pipeline that runs hourly, daily, monthly, depending on the need. Just a one time run, scheduling it on a cron, it doesn't work. And it's a little bit nuanced, but in order to create a production data pipeline, it requires a little bit adjustment into the mindset into the thinking. It's not your regular programming, regular programming that sounds like TV. It's not like your regular application development. It requires a little bit thinking of parameterizations for dates, parameterizations for certain fields. It's not rocket science, but it does require a different kind of thinking. So I reserve some room for the Q&A, but I just want to let you know that I've put some of the resources here as well. You can take a look at Spark if you're interested. You can take a look at Uzi if you're interested as well. And the first two are basically around what is big data. And the third one is an introduction to Spark. Why is Spark better than Hendoop? So that's the end of my presentation. I reserve some room for Q&A. Yeah, go ahead. So you mentioned at the beginning that you're moving everything from in-house to AWS. How worried are you about vendor login? Great question. That's the first question that I ask our architect. So honestly, there are advantages to vendor login. One being in an enterprise. We don't want an enterprise where what we're building in this VPC cannot talk to another VPC purely because they're different vendors. The second one is honestly, if you take a calculated guess, AWS is not going to go away anytime soon, at least not in my lifetime. So there's very little worry around that. And actually Autodesk, what we've done is we've taken one step further to partner with AWS. So a good point that I wanted to bring up. So I mentioned previously, right, we built this in-house. So when we were partnering with AWS and they were coming out with these new services, their services could not handle our load in our use cases. So that partnership was really helpful for them, for them to be able to ramp up their services. At the same time, it's helpful for us because then it becomes managed. We don't want to manage it. We don't want, you know, our 10 people, 10 engineer team cannot battle with their like 100, 100 engineer team, right? So it's a win-win in that case. So we've partnered with AWS as well. So kind of that sense of fear and that sense of apprehension, it's not there at all. Yeah. Anyone else? Yes? Okay. So the, okay, I know you said apart from managed, but I want to kind of take a step back there because when we say manage, it's not just someone is managing it up and down. It's also around SLAs, right? For the data that we are collecting previously, just to give you an example, when we were using Kafka, we were losing around 30% of our data. And it is because the amount of data coming in is a lot bigger and a lot faster than what the Kafka can handle. So when we switch to AWS, it's not just about reliability and someone maintaining it for us. It's also around their SLAs and their capability to auto scale up, auto scale down based on our needs. And the amount that we pay for that managed service is pennies in comparison to, you know, having one or two people dedicated to maintaining this Kafka stream to auto scale it, auto scale it down. And at the same time, if we've already lost data, you know, yes, we can retroactively fix it, but there's no going back to collect that data again. So in terms of why, what's the benefit, it's not so much technical and not so much managed as the promise that AWS has given to us that there's no data loss. And true enough, after we've switched to AWS, we have 99.9% of our data. There's 0.001 loss, but that is acceptable. Thank you for asking. Any other questions? What else? This is a good question. It depends on what the product is. So because Autodesk has like 120 plus products, most of them are around manufacturing and construction. And most, well, our biggest one is for design for AutoCAD is both design and manufacturing. So there are instances I know, for example, for AutoCAD, there is what we call telemetry data, every mouse movement is tracked. So that is why there's a huge amount of data being collected. Now that's for AutoCAD. Other products, let's say, for example, for structural analysis, every action that AutoCAD takes also introduces or also calls them a thousand times. So imagine that one slight mouse movement generates a thousand calls, we get that data. And for AutoCAD as well, any every mouse movement generates that thousand. So that's why the data is huge. Yeah. Yeah. It's mostly around how commands are used and mostly around, you know, different geographical locations basically they've put in their code. When you hit this, get the data. When you hit this, get the data. Thanks for asking. We have time. What else? I think we have time maybe for one more question if you have any. All right, I guess that's it. Thank you, everyone.