 Hello flu. Nice. I love this part. You know that part where you're in line for the ride and you've wandered all the way through all of the turnstiles and now you're next? We're next in line. We get to do this. Well, yeah, there is that. Yeah, it's watching the old people scream. Watching the kids scream is kind of interesting too. Welcome. Thanks for joining us. Okay, it's about that time. Today we're going to talk about Data Warehouse, Data Lake, Data Mesh, oh my, the history of data storage. We're going to take a look at Data Mesh in particular, double-clicking into that and kind of say, you know, I get asked a lot, is this just a trend? Should I, you know, skip this one and wait for the next one? Here's the part where I tell you I am definitely going to post the slides on my site tonight. If you heard that from any other instructor and I actually spent six months chasing a speaker to try and get their slides. By the time they actually replied, they're like, I don't even do that talk anymore. Which is why the slides are online right now. Let's head over to roberich.org and we'll click on presentations here at the top and here's Data Warehouse, Data Lake, Data Mesh, oh my. The slides are online right now. Yes, I am very loud. I don't know how to turn down the volume here. Okay, maybe I can turn down. Does that turn down the volume a little bit? Cool. Yeah, that was a little loud. So there's the slides there online right now. Head over to roberich.org and you can follow along and see the slides in real time, which would be really cool. While we're here on roberich.org, let's click on about me and see some of the things that I've done recently. I'm a Jetpack developer advocate and so if you're struggling to deploy to Kubernetes, I would love to learn with you. Some of the other things that I've done, I'm a Docker captain and a friend of Redgate. Microsoft has also given me some awards. One of the things I'm particularly proud of is AZ-GivCamp. AZ-GivCamp brings volunteer developers together with charities to build free software. We start building software Friday after work. Sunday afternoon, we deliver the completed software to charities. Sleep is optional caffeine provided. If you're in Phoenix, come join us for the next AZ-GivCamp. Or if you'd like a GivCamp here in LA or wherever you travel from, find me here at the conference or hit me up on email or Twitter. And let's get a GivCamp in your neighborhood too. Some of the other things that I've done, I do quite a bit with data and in particular data automation. So SQL source control basics, that was a lot of fun to write. I got to write a chapter in there. That was really fun. One of the things that I'm particularly proud of is I replied to a .NET Rocks podcast episode. They read my comment on the air and they sent me a mug. So there's my claim to fame, my coveted .NET Rocks mug. And if you'd like a .NET Rocks mug, now it's true. So let's talk about data warehouse, data lake, data mesh. We talked about this guy. So is data mesh a fad? Should I just skip this one and wait for the next one? I haven't really amortized the cost of my data warehouse after all or my data lake. So I'll just let this one slide. That's what I often hear. And so part of what inspired this talk, we're going to take a look at lots of different data storage technologies, database, data warehouse, data lake and data mesh. Then we'll double click on data mesh and kind of compare and contrast it to other philosophies and see if this one is just a fad or if we really want to dig in here. Let's start at the database. And actually before this talk, it was really cool. Some suggested I had to start even earlier. There are definitely data storage technologies that came before we got to digital. The card catalog in a library is a great example. We could also take a look at, you know, carved stones and fine stuff. But let's start at the digital age. We'll start with the database. So we have a database. We want to store some data. Maybe this is the late 80s, early 90s. We're, you know, within an organization. And we can probably build a database, one monolithic database, mirroring the one monolithic application. Maybe we have a user, a handful of users. And it's actually pretty elegant. Here's our data diagram. We have a user that uses the application that uses the database. What's our security model? Well, the building. It's probably in the back room somewhere. I actually worked at a company that all the non-people things were in one room. The server, the water main, all the telco stuff. That was an interesting game. But yeah, the building security is pretty much the boundary of this data store. And that's awesome. Our application evolved slowly. So our database evolved slowly. And it worked. In time, we got to this thing called SQL. Structured query language. It's always fun when I hear people pronounce it SQL. I'm like, there's only two groups of people that pronounce it SQL. The people who have only seen it written and never heard it. And the people who invented it, who explain it like this. So, yeah. Thank you to Donald Chamberlain and Raymond Boyce for building SQL. SQL is this mechanism where we have kind of an English explanation to the types of things that we're querying. Now, we're going to end up with a data store that is consistent. All of the rows have the same type of data in it. So, if I have a row that is a column, rather, that is a phone number, then everything in there is going to be a phone number. If I have it as an age, then everything in there is going to be a number. Now, that's great. We have SQL. We can start to normalize our data. We can think of this like Excel spreadsheets. Now, if I have an Excel spreadsheet showing a customer, well, how do I show the orders that they bought? Or if I have an Excel spreadsheet for an order, how do I show the products on that order? So, we'll have related tables, related sheets. And each table then has a foreign key back to the original source of data. So, our order line will have a foreign key back to our order, and our order will have a foreign key back to our customer. Now, we've chosen to normalize our data in this realm, both to avoid duplicating data, but also to store the data in a very compact way. Now, in this time, data storage is a premium. We're talking about, you know, thousands of dollars per megabyte. I have a friend who talks about, you know, data storage is how much data can I fit up my nose? In this realm, it was, you know, pretty much nothing. We normalized our data, both to avoid redundancy, but also to store data more compactly. So, then we had to do joins between tables to rebuild our data set into a complete set of data. Acid compliance. One of the big things about SQL is acids. So, let's double click into that, and we can take a look at acid compliance. We know that given these properties of acid compliance, the entire transaction will either succeed or it will fail. We won't get some of the data in the database and some of it not committed. If we're going to do this, if we're going to update the data, all of it will get in or all of it will fail. Now, that's really elegant. We can say, hey, you know, given this constraint, I know that my data is consistent. By comparison, we can compare this to eventual consistency. I walk into Starbucks and I place my order. They're not going to hand me the cup and then pour stuff into it and then put the cream on. I don't know how coffee works, so, you know, there is that. But, and then finally put the lid on it. Instead, they'll give me a token or I'll give them a token, in this case my name. And once the process is done, then they'll call me and deliver the finished product. We probably do similar things on e-commerce sites. When I go to Amazon and I check out or a flight, if I'm booking a flight on my airline, when I check out, I don't have the ticket yet. They'll email me that in a few minutes or an hour or two once they have that final ticket number. But I'll give them a token or they'll give me a token and once it's done, we'll get it resolved. Once I finish that purchase, if I go straight over to my credit card and I go look for my account balance, that charge may not be on there. But by the end of the month, all of the charges will be listed and I'll be able to reconcile that. So, acid compliance versus eventual consistency. Now, we may have different needs for different problems, but the cool part about SQL is that we have that acid compliance. We have acid compliance, we have normalized data, we have this strong query language, and so it's really cool. We have this database that kind of mirrors the monolithic structure of our application at this point. So, on the upside, we have SQL security is easy because it's the boundary of the building. Changes are rare, so we don't need to evolve the schema very much. We have a strong schema. We focus on small storage to preserve our limited storage resource. It's working out pretty good. And a few years later. So, now what? Well, let's build some reports. Analytics. Now, the cool part about analytics is we're able to take a look at lots of data and infer trends or calculate totals. That's pretty cool. But it is also a very different characteristic from the normal details of our database. We're querying in bulk. So, our data map may kind of look like this. We have a user using a data entry app. They're finding a particular record, add or updating that particular record. It's very transactional. It's very small amounts of work. By comparison, we have our analytics users using BI tools, perhaps, and they're doing these bulk reads and no writes against our database. Now, in this era, maybe they're the same user, maybe the reports aren't as intense as we're making it out to be here, but these reports do have a very different query shape from the query shape of our transactional users. So, yeah, we're doing bulk reads and no writes. That's very different from our transactional users that are doing a seek and updating very small amounts of data. So, when we're doing these analytics operations, we're doing these bulk reads, and so we're probably walking the table, maybe for a significant amount of time while we read all that data. Now, we definitely could read uncommitted, so, you know, not like, walk the table, but now we've lost our ACID compliance. So, is that an accurate total for the month? Or were there transactions coming in that we got part of, but not the other part of? These are concerns that hit us. And so, yeah, let's walk the table, let's run the report, and now we're starting to impact our transactional users because, well, the table's locked, so now the application is slowing down. Hmm. So, what can we do here? On the upside, we have great strong schema. We're storing it in a very small space. On the downside, because we now have these analytical reports coming out of it, then we're locking the tables and we're yielding a poor performance for our applications. So, analytics locks the table, kind of breaking transactional needs. Can we, like, get rid of those report users? Push them off to the side, maybe? So, how do we make analytics faster? Make the report run faster. That'll do it. Then we aren't locking the table for as long. Let's, you know, add indexes. Let's add statistics. That'll make our data perform faster, except for now all of the writes take longer because we're kind of pre-doing those reports as we update each index or each statistic on the way through. So, what if we separate our data store? Let's do online analytical processing, or OLAP, by comparison to online transaction processing, or OLTP. So, OLAP. The theory behind OLAP is let's separate the transactional workloads from the analytical workloads. If we store them in different databases, we might be able to tune them for different workloads to be able to, you know, avoid that impact on our traditional users when we do these analytical queries. So, in here, in our transactional system, we'll seek for a row and then either add or update that one row. By comparison, in an analytical system, we're doing bulk reads of data and we're not writing at all. Those are very different query shapes. So, let's separate it and do this. Now, we still have our user using our data entry app, hitting our regular database. We'll call this the OLTP data store. And we also have a user using analytical tools, hitting an analytical database over here. The OLAP database. Now, that's awesome, but now we need to get the data from here to there. So, let's do this ETL process, this extract, transformer and load. Or maybe we do an extract and load and then transform in place, so an ELT process. We have a new piece. How do we get the data out of here into there? That's pretty cool. How do we get that data, that ETL process? So, ETL, extract, transform and load, that process of bulk reading all of the things from the transactional data store and updating them in the analytical data store so that we don't need to do that every time we run a query. I just said that again. That's cool. So, ETL, periodically we're going to read all of the changes or maybe all of the data. We're going to suck that out of the transactional system and write it to the analytical system. Yes, it's going to lock. But the locks involved in doing this one ETL process are much less than the locks for running every query inside of our, every report inside of our analytical system. So, once, I don't know, once a night, once an hour, let's go find all of the changes and stick them all in that other data store. Now we end up with this wonderful ETL dilemma. Yes, the ETL process is going to lock the transactional store and while it locks the transactional store, then our users are impacted. So, let's do the ETL process less frequently. Therefore, we're not impacting normal users. They can accomplish their tasks faster. But now the reports are stale because if we only ran the ETL process once an hour, then once a day rather, then the best we can do is the report as of last night. So, let's run it once an hour instead. The faster we can run this ETL process, the fresher the reports are, so let's run it maybe once a minute. Hey, now we're locking those tables pretty substantially and now our users are starting to complain performance. So, let's turn it down. We have this dilemma. If we ETL more frequently, we impact our users. If we ETL less frequently, we have stellar data on our reports. Is there a solution to that? So, OLAP databases. We separated our transactional system from our analytical system. Now, because they're separate, we have a few times where we do these bulk reads in our transactional system, but that's a whole lot lighter weight than trying to do those bulk reads every time we want to run a report. These data stores are now separate and so our analytical users can run all the reports that they need to based on those query shapes. And our transactional users can go find one row and add or update it really easily. That's cool. We ended up with this ETL scheduling dilemma, and that's kind of a bummer. Should we ETL faster or slower? Do we want to impact our users more or do we want to make our reports fresher? So, OLAP databases. It requires that we do this ETL process, and the ETL process is by definition a compromise. So, let's see what we can do with that. OLAP databases. Let's rename them and we'll call them. Let's tune the analytical database for queries. And let's focus in on kind of making this analytical database really focused in on that querying responsibility. And let's call it a data warehouse. Now, what's the difference between an OLAP database and a data warehouse? The name? But now that we've kind of deemed it completely separate, let's start these experiments of making it more tuned to the bulk read type of scenario. So, now that we have them separate, we can modify the analytical system without impacting our transactional users. So, we still have the same data network diagram. We have a transactional system that facilitates our transactional users, and we have an analytical system that facilitates our BI users, and we have this ETL process that gets data from our database to our data warehouse. It runs periodically, but we're going to tune our data warehouse to make it more geared towards this analytical problem. So, let's dig into some of the experiments we might do in this analytical system. How about we remove foreign keys? That's painful on all kinds of levels. But we're not writing in the analytical system. The purpose of a foreign key is to validate during a write operation that that other data exists in an analytical data store. Either the other data already exists or it doesn't. That decision has been made. So, this foreign key is just making the write operation into the analytical data store slower. So, if we remove the foreign keys only in our analytical system, then we can load that data a little bit faster and there's less overhead in our database. Yes, the foreign keys still exist in our transactional system. Don't remove them there. We're only going to remove them here in our analytical system. Next up, let's create different indexes. Now, in our transactional system, we probably needed indexes geared towards the screens within our application. We need to be able to load a list of customers and we need these particular fields because those are the ones that are shown in the Select a Customer dialog. We need to be able to grab all of the customers or maybe we need to grab both the order and the order details at once and we'll always do that by order number. So, we've tuned these indexes and these statistics to match those use cases. By comparison, the use cases inside of an analytical system might be drastically different. We're going to sum orders by region. We didn't have an index on region before. We didn't have an index on salesperson before. So, let's create these different indexes to kind of facilitate that analytical workload. Yeah, they can be different. That's kind of cool. Next up, let's denormalize the data. Uh-oh. This is really uncomfortable. Wait, why did we normalize the data? We normalize the data for two reasons. The first was to avoid data duplication. If we're going to write the data, we want to make sure that we're only writing that in one spot. And the second was because storage was expensive so we want to store it smaller. Our problems today are not data storage capacity. Our problems today are data latency. So, we can store it bigger. My friend who talks about, you know, how much data storage can I fit in my nose? He got a petabyte in his nose once and I'm like, that is cool. Yeah, data storage is not our problem anymore. Data latency is our problem. So, we're going to denormalize the data, understanding that that means that if we write data, we may need to write it into multiple places. But now we can read data a whole lot easier because we have the related data all in one place. In our transactional system, it is still very much normalized. But in our analytical system, maybe it's denormalized a little bit to make those reads faster, to make the reports run faster. So, why did we normalize the database? Because we wanted to fit in limited disk space and that's not our problem anymore. Yeah, I already walked through that. Our problem is not data storage. Our problem is latency. So, here's a normalized database. We have the customer, we have the orders and we have the customer ID foreign key back to this table. In our transactional system, we definitely want that because we want to be able to update the customer name in one place and have it take effect to all orders everywhere. But in our analytical system, maybe we're going to put the customer name as a column inside of our order table. Now we don't need to join the customer table to be able to get at that data. Now, it makes sense for some columns. It doesn't make sense for other columns. But if we duplicate the data to avoid the joins, now we can process these reports faster. It does mean, as we're doing the ETL process, we do need to publish all of the updates to all of the spaces where it's written. But that update process happens infrequently. Our reports happen much more frequently and the latency involved in the reports is much more important than the latency involved in the ETL process. So let's denormalize our data. Now this is an experiment. If that doesn't work, we can definitely go back. But it's one way that we can start to speed up our analytical systems. Next up, let's experiment with storage types. Now SQL was great. SQL allowed us this optimized query engine. It gave us asset compliance. But we can start to experiment with database engine types maybe to optimize for the particular workload that we're after. Maybe we need a graph database to be able to keep track of relations. Or maybe we need an event database to keep track of very high throughput event-based data. Or maybe we need a NoSQL database. A document store. And one that's particularly interesting is a column store. Here's a row store where we take a look at a traditional SQL database. And here's a column store. In a column store, we can think of this kind of like the row store's index. Imagine that I have countries in the world or states in the United States. If I have states, I can probably make this index pretty small. There's only 50 of them. If I'm doing countries in the world, there's only 200 something of them. So now if I want to bulk read based on countries in the world, I can quickly seek to that particular spot and I can bulk read all of the data associated with that. That's cool. So if we store our data in columns, then maybe we can do those bulk read operations a little bit easier. Now columns make sense when there's a high correlation between data. The countries in the world or the states in the United States. But it doesn't make sense for something like phone number. There's probably not going to be very many duplicates of phone number. In fact, if I find a case where a phone number is duplicated, that might actually be a data error. I don't want to create a column store index on phone number. I want to create a column store index on salesman or product type or region. So playing with these different types of data stores is really interesting. Let's also take a look at the SQL and NoSQL experience. Here in SQL, I have the SQL query language and I'm digging into the customer table and that works out really well. But one of the expensive pieces inside of a SQL query is joining to that other table to be able to get that related data. By comparison in a NoSQL database, I can have an array inside my document and that array can keep track of all of that foreign key data. Maybe I want to store the order together with all of the order's lines in one document. Now, when I want to go read that order, I just read that one document and I have all of the things in one spot. I don't need to do joins. Now, maybe this doesn't make sense for everything, but it does make sense when I have maybe a news site where I want to be able to hit that slug, load the page really quickly and get all of the related data. So I'll put in the article, the article title, the author, the author's name, probably the author's bio details and their image. Maybe even the first row of comments. All of that will get stored in that one document together so that when I hit that document, I can read it in one shot and display it on the page. So we did some experiments with data storage types and we did some experiments with denormalization and with foreign keys. Let's do another experiment with horizontal scaling. Now, the SQL databases were built in the realm where we pretty much had a monolithic machine. Now, that may make sense in some scenarios still. If I have an auto-incrementing ID, I can't exactly distribute that across lots of machines. They might get collisions. But in this read-only scenario, I don't need to keep track of a singular type of data. So what if I distribute my data across horizontal nodes? Now, am I sharding my database or am I duplicating it into multiple pieces? Ultimately, we can do those experiments, but we can do more horizontal scaling, create more copies of these machines so that we can get at our data a little bit easier. Now, this is an experiment. Maybe we get to the end of that experiment and we didn't speed up data access. But if network latency was a big thing or CPU contention was a big thing, maybe horizontal scaling will get us to the right spot. As we start looking at horizontal scaling, we hit right up against the cap theorem. The cap theorem is really interesting. We can have consistency, availability, or partition tolerance. You can pick two or you can pick one really hard, but there is no overlap for the three. Looking at this as a Venn diagram, we really want to be right there, except for that isn't a thing. Now, what's interesting about this cap theorem, we're looking at availability, consistency, and partition tolerance. Let's look at this in the context of analytics. Consistency. We're not really writing to this data store. We're doing read-only reporting out of this data store. So consistency probably isn't a big need for us. Similarly, partition tolerance. Is it important if this partition gets out of sync of that partition? It's a read-only data set. And he don't change other than that ETL process. So partition tolerance really isn't important. With our reporting systems, availability is that important piece. We want to be able to run these reports quickly, and we want to have these reports available. So we're going to lean on availability here in the cap theorem and not focus so much on these other two. Now, that is a trade-off. That does mean during that ETL process we might get into some spots where data is inconsistent, or we might end up with split-brain. But we're going to say that availability is more important, so we're going to understand that that's okay. So we did these experiments. Remove foreign keys. Create different indexes. Denormalize the data. We looked at other storage mechanisms, and we started playing with horizontal scaling. Now, all of these experiments are inside of our analytical system. Our transactional system is unchanged. But because our analytical system has different query shapes, we can do these types of experiments and yield maybe really effective results for the reports, for the bulk reads that we're trying to do here. That was cool. Our data warehouse is looking pretty awesome. We did some denormalization to make our analytics fast. Our schemas can evolve differently. So now our transactional system and our analytical system, the schemas may be significantly different because of that denormalization effort or because of a different data engine that we're using. Cool, let's stop. We solved it, right? Well, we have all of these different data warehouses for different teams within our organization. How do we combine the data between those teams? I want an organization-wide view of all of the different things within all of the different data warehouses. So, yeah, we ended up with lots of little data warehouses, one for each team, optimized for that particular use case, and that worked out really well. Their queries are doing great. But now how do I join this data warehouse with that data warehouse to get holistic organizational understanding? Even more so, how do I discover each of those silos? What do you call a customer ID? Can I use your customer ID and join it to their customer ID? Probably not. Is the account number the same? Maybe. Is the name the same? Probably not. How do we create these correlations now between these various silos? And that's the next problem that we want to take on. So on the upside, we were able to tune our queries to be able to optimize it for our particular use cases. That was great. On the downside, it was difficult to now join these between teams because each data warehouse was a silo. So maybe we could ETL between data warehouses. You ever get halfway through that problem and you're pretty sure that this was a good idea when you began and you get out in the middle and you're like, wow, this was a really bad idea. No, let's not ETL between our data warehouses. Let's change the battery in our things, in our clicker. So we're not going to ETL between our data warehouses. We're going to form a new concept called a data lake. Now what is a data lake? A data lake will consolidate the data across the organization into one giant thing that we can then query across. We'll focus on high availability. We'll focus on speed and performance of ingress and reporting. And we'll get to that spot where we have this really cool lake of available data products. So we end up with a diagram like this. Yeah, each of these teams were able to do their processes, but they each ended up with a separate data warehouse. So we're going to pool all of those data warehouses together here into this data lake. Now here in the data lake we'll have different query engines. We'll have different query shapes. People may be storing data in flat files or in different vendors things. That's okay. But we're going to have the ETL process to get data into the data lake from each team. And we'll have a ninja squad that manages this data lake so that we can build something like a data catalog. Now this data catalog can kind of keep track of what's here in the data lake so that as analytical users approach this with their BI tools, we can understand what's in the store and understand where we need to go. Hey, that's pretty cool. We might have different vendors, different schemas, different data in this thing, but it's very highly available. That's excellent. Now we can query across all the data in the enterprise. So let's coin a term like DataMark. Now DataMark might combine unique views across different databases within our data lake to be able to build unique views of this data. Now because it's within the data lake, we have our ninja team keeping it highly available and keeping the data moving quickly. So we end up with SLAs on performance and availability. That's cool. I like the ninja squad being able to tweak the performance of that and get it just perfect. So the specialized team ensures performance, availability, that the ETL jobs run correctly and get the data into place. Kind of. So how do I get my team's data into the data lake? The ninja squad ends up now with a backlog of projects trying to get into the data lake. We end up with this contention between teams as we're trying to get at this limited resource that is the process of getting into the data lake. Not only that, the ninja squad knows nothing of the domain knowledge associated with my data. Their job is to make it performance and highly available, not to make it, I don't know, relevant. So we end up with a data lake with really highly available goo. Our data lake has become a data swamp. It's highly available, it's very performance, but it doesn't mean anything. What is the customer ID? It's the ID of the customer. What is the correlation ID? It's the ID that correlates things. We end up with these types of descriptions inside of our data product catalog that ends up not really giving us information. We have a very leaky abstraction. We're trying to get to the point where we can query across teams, but we still have to get beyond the ninja squad to the domain experts to try and get contextually relevant details inside of our data lake. It's awesome, it's highly available, but we have this contention trying to get new data sources into the data lake and we have this data quality issue where the people maintaining the data aren't the people who understand the data. So it's highly available and discoverable, but it doesn't make any sense. Is that your data lake? Yeah. So a few years later we're like, okay, so now what? Well, how do we avoid the swamp? How do we add that contextual relevance back into our data? How do we avoid the bottleneck that is the ninja squad keeping the system highly available and performant? Now we chose to consolidate our data to be able to get at that performance and that availability, but we compromised domain knowledge and quality as we did so. Was that trade off worth it? Maybe. Maybe not. So let's coin a new methodology. Let's call this data mesh. Now as we talk about service mesh, we get into very different types of connotation, so it's unfortunate that it's named data mesh, but it is data mesh. Let's take a look at data mesh. Now with data mesh, we have a consolidated product catalog and we have distributed data nodes. That's pretty cool. Let's take a look at our architecture diagram. We have a user that uses a data entry app that gets at that transactional system. We have this ETL process that gets into a data mesh node. Now this team is going to own that entire process, getting the content into their data mesh node. This other team does a similar thing, getting their data into this data mesh node. And here we then have that data mesh that is all of these data mesh nodes together with a centralized product catalog so that we can understand the details associated with this data. We might choose to build our data products, we might have some data marks on top of that, maybe harvesting unique insights into one or more data stores. This then is distributed data storage, central data catalog. Now what about the data warehouse? I didn't finish amortizing it. It's, you know, this collection of things. It's a node in the data lake. Now in time we may choose to distribute it back to the teams that owned it, or we may just leave it as a data lake there. This is one of the nodes inside of our data mesh. As our BI users hit it, they can understand the details in the product catalog, and they can be able to pick the data mesh node that they need to be able to identify the reports that they need to be able to run. That's pretty cool. So our principles of data mesh. We have distributed data storage and a central data catalog. We're going to present these as data products in the same SLAs that we had before around availability and performance. But now hopefully we can also get SLAs around data quality. The people maintaining the data node are the domain experts that maintain the transactional system. So data quality can be part of that data node as well. So how do we think differently here? Oh, the domain experts that maintain the OLTP system maintain that data node to ensure data quality. I like the analogy of a grocery store to kind of describe a data mesh. I'm going to walk into the grocery store. I'm going to go to the bread aisle, and I'm going to be able to compare the various labels on the various types of bread to be able to pick one. Now, the grocery store did not make any of the bread, or maybe they did make a little bit, but not very much. But there's lots of brands for me as a consumer to pick from. Now, I compare the labels, I can compare the facts around them, I can choose the bread that exactly matches my needs, and that's the bread that I'm going to purchase. Now, where this analogy breaks down a little bit is that kind of describes a data warehouse, centralized. But here in a data mesh, what I'm identifying as I compare the various breads is the address of the grocery store, or the address of the bakery. So I'm in the grocery store, I pick my bread, and then I drive to the bakery to go get it. Yeah, the analogy isn't perfect. But it's a cool analogy. I can walk into the bread aisle, I can compare those labels, I can take a look through the data catalog, I can understand the choices that I have available, I can compare and contrast them, and choose the data source that exactly matches my needs. Then I can go to that distributed data source and get that data in a highly available way. I have data products based on a centralized data catalog, but those products are distributed. That's cool. So, our data mesh. Now, unlike the data warehouse, we have distributed data. The people who maintain the transactional system are the people who maintain this data mesh node so we can focus on quality. The consistency and availability should match the data lake. Now, we may need to make some staffing choices to make that happen. Let's take the ninja squad and let's distribute them across the teams or maybe we'll let them parachute into a team to help them get that node going. But there is a little bit of duplication of effort now because each team is standing up their own data mesh node. So, issues around availability and performance are a team's responsibility and we need to find a center of excellence to be able to share those skills and knowledge across the organization. We also have the ninja squad that is no longer a team but rather a series of consulting experts to other teams. It was easy now to enroll new pieces into the product catalog because we just need to stand up a node ninja squad to prioritize it. But if a node doesn't get added to the product catalog, is it part of the data mesh? It might be possible to lose track of some of the silos within the organization because they didn't know how to enroll in the data catalog. So, maybe they're part of the mesh but they're not discoverable. So, that's interesting. We have a very different paradigm from the data lake. The data lake, all of the data was centralized. Data mesh, the product catalog is centralized but the data is distributed. That's cool. Now, one of the interesting things is that data mesh is not a technology. I'm not going to buy a data mesh. Data mesh is a methodology. Well, that's kind of interesting. Let's compare this to other methodologies that we found in tech. Let's compare data mesh and microservices. Now, why did we move towards microservices? Monolith was kind of interesting. We could deploy it all at once. We could deploy it infrequently. We moved towards microservices so that we could distribute that problem and deploy small pieces as quickly as we needed to. By comparison, our data lake was this monolith and we've chosen to distribute these microservices, these data mesh nodes, to be able to focus on agility of getting data into the mesh. A similar comparison, we can take a look at data mesh and DevOps. Now, in DevOps, we pooled lots of different specialties to be able to get at the particular needs of that DevOps pipeline. Let's pull the software developers. Let's pull operations. Let's pull security. Let's get them testing. Let's get them into the same room. Let's let them automate their processes. It's unfortunate now that we have DevOps engineers because it kind of breaks that paradigm, but DevOps as the methodology was a great cross-functional way to be able to get lots of people in the room to optimize this process, to deliver higher quality, and to deliver more quickly. Similarly with data mesh, we pool all of the different experts. We have the domain experts from the team. We have the ninja squad that is able to comment on SLAs around performance and availability. And we can pool these users, perhaps maybe even with the BI users, to understand the needs of our data mesh node so that we can denormalize our data effectively and so that we can make those trade-offs with ETL around availability and freshness of data. We have this cross-functional team that comes together to make this possible, really similar to DevOps. So what about the data lake? Do I have to throw away my data lake because I've chosen data mesh? No. The data lake is one of the nodes inside the mesh. Now, in time, we may choose to distribute the data lake back to the teams that owned it, let them own the ETL process, and now we avoid the ninja squad carrying for that centralized spot. But in the meantime, that data lake is doing just fine. Let's make it a node inside the data mesh, and if we can get out of data swamp mode, then maybe it's just one of the nodes there in the data mesh, and we can just let it be. We don't need to abandon our investment in data lake just because we want to move to this distributed methodology of data mesh. So should I dump my tech and start over? I'm halfway through my startup experience. Should I just flush the data lake and the data warehouse and start over? No. This isn't methodology. This is a way of thinking about this data. We don't need to buy a data mesh. We don't need to fund a team of data mesh. Maybe we're taking that ninja squad and distributing it across the team, but probably that's the only major functional change within our organization. This is a methodology, not a product. So in answer to my friend who asked me, should I skip this one and go to the next one? I don't think so. Let's take a look at other fad technologies. The Internet. Yeah, I don't think that's going to be a thing. Agile. Now, unfortunately, the marketing people have gotten a hold of Agile, and so maybe the term itself is less meaningful, but that mechanism of people over process, of people over tools. DevOps, that process of getting cross-functional teams together to coordinate to deliver more quickly. I would argue that data mesh joins this as one of the methodologies of how modern software is built. Data mesh. We can embrace the methodology of data mesh to discover data across the organization, to easily publish new products, and to be able to facilitate the domain experts who are working on that transactional data to ensure data quality. Unlike the data warehouse where only the ninja squad could do it, and they were the gatekeepers. With data mesh, we can do this cross-functional process to be able to get lots of people into place. These slides are available right now on robritch.org, and if you find questions later or tomorrow, hit me up on Twitter at rob underscore rich, and let's continue the conversation. For those of us live in the event, what are your thoughts? What are your questions? Yes, you're right. As we are starting to make these more distributed back into their teams, maybe that's not the best way that we need to be able to consume unique data products. Maybe we do need to be able to look across teams to be able to get the insights that we need. So we might want to build new views over the top of those distributed data products to be able to get those unique data products. I like that. Yes. Good question. So how do we get data mesh into an organization where they have a very established data warehouse or data lake methodology? Yes, it's hard to get it started, but in a similar vein, the first unit test is always the hardest because you're proving that process. You're trying to get the DevOps pipeline to call into it. The moment that you have one, getting two is that much easier. Getting three is even easier than that. I would argue, can you get a champion above and a champion below to be able to make that happen for maybe a smaller risk project? Let's not add this to the data lake. Let's instead make this separate but still publish it to the data catalog. If we can get a champion above and below to be able to make that happen and it's a lower risk project and it succeeds, maybe that can help us start the flywheel that gets it going farther. Ultimately, if our organization says don't bother with the unit test, we're not shipping that code, yeah, that's a difficult fight. Great question, thank you. Yes, good question. So how do I get the ETL process into each node? The ETL might be a little bit different, but how do I get common learnings or standardization across the organization? What's interesting about a data lake is standardization is really important because it's ending up in one place. What's interesting about Datamesh is it's less standard. So I do want to have a center of excellence where I can take learnings around ETL and start to distribute it among lots of teams, but I don't necessarily want to legislate exacting similarities because that might lead to data quality issues or barriers to entry for getting more nodes into the organization. It is a delicate balance because I could definitely fall off the edge on either side, you know, and now it's difficult to get data in, or don't standardize at all and now every team needs to reinvent the wheel, but let's find that careful balance where we can use each other's learnings but not be dependent on each other's assumptions. Not a great answer, but I like that thought. Yes, good question. So I've been very careful in avoiding mentioning products today and can I step on the other side of the thing and start talking about what products we might use to get a data mesh? No? Yeah, my goal is really not to pitch today, so I'm going to avoid mentioning products. I love that thought, though, and so, you know, find me over a beer and let's definitely get really intense. Yes, great question. Thank you. Yes, good question. So I've got maybe some non-transactional data sources. Maybe I've got some data streaming. Can I get that in my data mesh as well? Yes, in the way that we get it into the data lake today, we can get it into the data mesh. So maybe streaming is the way to avoid that, you know, kind of transactional nature of an ETL process. Can we just pipe the transaction log from this through a stream and get it to pipe into our analytical system? Yes, it is fun. In the same way that you've done it in your data warehouse and data lake, you can do that in your data mesh as well. Well, this has been a lot of fun. Thanks for joining us today. Hello. One, two, three. One, two, three. I think we can start. Hello. I am Antoni Ivanov. I am a software engineer at VMware and a technical leader for a new open-source project in the data space for a style data kit. A little bit more self, I live in Bulgaria. I am a really huge believer of data-driven decisions and using data in our everyday lives and in professional lives. Have you watched Moneyball? How many of you have read or watched Moneyball? Yeah, it's a great movie for those who have not watched it. It's about this open-athletics baseball team and their general manager, Billy Bean, who is played by Brad Pitt in the movie, and it focuses that the team has lost their stars and they have very low budget, so the team uses analytical, evident-based and data-driven approach to really assembling a very competitive baseball team and succeeding. So, in my opinion, really good decisions, correct decisions are often one that we use data, data-driven decisions, so they usually term purely intuition-based decisions. There are still a lot of challenges in making efficient use of data and I am going to focus around those that are very similar to the ones that have been soft-engineering. I have been working in VMI for the past eight years in the data analytics platform and that capacity to be able to work both as data analysts, creating reports, as an infrastructure operator, operating costers in Kubernetes, Kafka, and these kinds of different roles. The last two, three years, our focus has been how do we make efficient data engineering, how do we make it easy for data analysts, analytics engineers to be able to focus on SQL, modeling, creating bug dimensions, creating reports, ML models, and abstract away everything as much as possible. A quick sneak peek, so basically we want to go from a model where we have pragmatic infrastructure, silos, and a lot of tension especially between improvs and data teams to a model where basically everybody focus on work that they're really good at and they want to do. So, the agenda, we first go over the challenges that we want to solve, why do we need DevOps for data. Next, we will focus on the DevOps for data as a service, basically what is the solution, and they will present three demos. It's quite possible, we won't have time for all the three demos, so I will let you vote for which one of the two demos you want me to show. So, let's look at the challenges. Let's sort of introduce the two actors in our play. At one time, we will talk about infrastructure and operations teams, and from the other side we will talk about data teams, those are the two personas we want to focus because I know that definitions for personas can be in different things, so let's define them. For the operations teams, those are generally the people who provision the infrastructure, those are people who can know about provisioning Kubernetes, how to secure it, those are people who provision and Spark clusters, Kafka clusters, in power, Presto, or Qualtor sources like AWS Redshift and Kinesis. They understand the application of data infrastructure, but they know that using HTFS it's better to use bigger files, not small files, while Kafka accepts small messages. And on the other side, also they understand how some depth of practices, like how to build continuous engagement, continuous deployment, how to make sure that code is version, traceable, and so on. Overall, their goal is to maintain infrastructure to make sure things work. In other words, they optimize for their ability and their ability. On the other hand, we call data teams, and I'm here using the terms very broadly. It could mean data scientists, data engineers, analytics engineers, heavy engineers, I don't know which terms are used, but they have the domain knowledge and the business knowledge to create the correct models for their business use case, to create, they know the column names of the tables, they know how to best join them, they can create ML models to make recommendation systems, and so on. Their focus usually is on making sure they can iterate very quickly in order to deliver to businesses. Nowadays, they move very quickly and in order to be able to make good decisions, you need to keep the data with the same speed. So, for example, if we want to see if a new feature is adopted, we want to add a new column for this new feature in the table in a matter of hours and not weeks or months. But those two priorities sort of in conflict with each other. This is very similar between development operation and DevOps world, where we want, in a data team, we want to optimize for quickly the value of the data. But on the other hand, the infrastructure team wants to make sure that things are working. I do have a lot of situations where a data team would create simply because they will use insert values to create a thousand insert value statements, which would cost huge work on the database, instead of batching them. They would, for example, install huge columns in a single column, huge values in a single column. The other problem is also that there may be often no clear separation between what each team needs to do. There are some times where the data team does not need to provide their own infrastructure, so they need to go and start easy to instance. This may be a great ground job and all kinds of things. At the same time, if SAPIC is we've had operations team that has sort of going the book almost the data application the job that's breaking the production report because, you know, the VP is shouting at them and they need to make sure it works and this kind of overall costs both for both teams to have much worse operations and development efficiency. There's a very similar problem in the sort of DevOps world so we need to see how we can sort of adapt and adopt them due to some differences between DevOps in the software engineering and DevOps or data engineering. If we can provide a solution where multiple teams can work with data in a way that's easy for them to consume and at the same time for infrastructure operators to be able to efficiently operate it and elevate those problems. What this means, so what we need to do for data teams, we need to have sort of enable them to create easy workload creations like have self-service capabilities so data teams can create end-to-end pipeline and all completely their own work. At the same time, we want to abstract and sort of manage the infrastructure as much as for them so when they go, they simply create a job or maybe deploy a job or maybe they just do execute a query and they don't need to consider if below that the files are in-parked, if the code is versioned and so on. And overall things should be fully automated where the cycle and how data journey is automated for them so they can focus on business logic. At the same time we need to enable IT to sort of establish policies it may be government policies, it may be just best practices and make sure those best policies are observed for example we can can we ensure the data is anonymized can we ensure that maybe all files that is executed in the main environment are not executable. We and next we need to ensure that sorted infrastructure is controlled, that's probably the flip coin of managed infrastructure for example IT team best knows that in order to ingest data into Kafka you need to keep all messages very small so if you can provide some mechanism for the IT team to infrastructure operators team to basically encode these kind of rules so that when the data engineering team they are writing files, they may be using interface and authentication and at the same time behind the scenes those files will be chunked automatically as part of the user code so it can provide this kind of extensibility plugin mechanism so that IT team can provide these kind of features for their own data engineering organizations that would sort of alleviate a lot of those problems because infrastructure team which knows what is the best way to ingest data, the best way to query data, they can encode those rules and the data team just focus on sending data for ingestion or querying data and that's what sort of diversity data kit project aims to do solve those kind of needs for the data engineers and for the infrastructure of people and operators where a style data kit contains two things, two components one is the first style data kit SDK it's a python based application which enable data engineers to write any SQL Python, enable them to the interfaces to help them to ingest data and to transform data it has some goodies like clean engine quality and importantly it has leg-wiking extensibility so it enables exactly this basically sit between the data pad and enable to establish those kind of best practices and the other part is a style data kit control service it's run time for data jobs, it provides an RAC API which sort of abstracts a lot of the data cycle for example with automatic versioning and deployment it has monitoring and awarding capabilities a lot of basic things, it's cloud native so it's integrated with Kubernetes and you can have monitoring in Prometheus and it's a primarily infrastructure operations team so it can, at high level we need to do two things one, we need to sort of automate and abstract the development process so this is how we give power to operators to establish best practices so what we can do is with style data kit we can automate and abstract if you flatten the data cycle we can automate and abstract large parts of it so we can take care, for example we let data teams only plan and code and they try to code in some folder and then we can take care of all the necessary containers, install dependencies that's a requirement file we run some tests, version it and release it and deploy it and behind this how it's happening behind one single method which we can call vdk deploy at the same time we can enable separate team to basically establish policies and make extensions by plugging into the app of cycle so they can control how the things are built and released a very quick example let's say that we want to create a false plugin which will run some standard system tests for all data jobs in the enterprise so this is because the style data kit has concept of teams, it's more standard so every data job that's running managed by the style data kit would be able to before it's deployed to run those checks so for example let's say that we want to run some standard test and at the same time we want to do security hardening what we are doing is we simply run some system tests that some standard live team has come up and we remove execution privileges for files during build so that when the jobs are deployed to run in production they will not have any execution, they won't be able to be executed. This is an example taken from some use cases we had where we need to do some security hardening and this is basically a Docker image which can be configured during installation and the second important piece that we need to do is to automate and abstract the data journey in a way if you look at the data journey which you know you ingest data from data sources you transform it and actually you use it to make insights we should be able to make sure that when accessing the database there are some base practices that can be enforced sort of VDK can act as we can pick it almost as virtualization layer so data teams can use the same interfaces for example DB happy for SQL connections in Python and although in this virtualization layer we can encode using plugins different rules which we'll show in a couple of demos so that it's more understandable so if we circle back basically we can say that in a way it's SDK though it's primarily used by data teams so they use it to develop data jobs and to basically for them we enable them to abstract and automate their data journey primarily abstract in the same way we can then control this primary deal so to automate and abstract in the DevOps cycle so it's a bit more clear maybe we can show a quick 2-3 demos the first use case we'd like to tackle explaining it sort of how do we install and deploy the SDK and what do we mean by the implemented infrastructure so what we're going to do so we're going to let's say we have multiple tools multiple databases we're going to build custom SDK which is data engineering facing and we're going to deploy it and have them use it and at the same time we have runtime environment which they can deploy their data so first we'll build our custom SDK then we'll install our style data kit using that custom SDK and then we'll see how this is used by data teams we'll start from scratch so the SDK is simply a Python application in our case if you use Python you find this very trivial I'm using set of tools in this case but we can use any kind of distribution package library we just here like we really can specify all dependencies not sure why it's so bored but we can specify all dependencies like we specify here VDK Postgres and VDK Snowflake so let's say that our organization has dependencies to Postgres and Snowflake and we want to interest that using HTTP file after that we can specify a configuration plugin so in a way we don't have to the data engineers which are using the SDK we wouldn't need to worry about what are the best ways to configure how to connect to the Postgres database we can simply specify all the necessary configuration and I guess we'll just leave user in password for them and in a way that it's a Lego we simply let it's a new Lego piece to our dependencies basically the custom SDK that we want to build which we've named myOrcVDK now this custom SDK is ready and it's a Python application so we can give it to data engineers data analysts they can use it locally to develop data jobs but eventually they'll want to deploy those data jobs to run in a managed environment which is what the VerstileDataKey control service will provide so let's see how we can install it first thing we need to do with our MarkovDK we need to create a docker image because that's how it's configured the control service now needs to use this SDK to run the data jobs with we will probably target with version and release and next we will use a HelmChart the control services start using HelmChart and in this case we simply specify the docker image we have just created myOrcVDK the tag is released we use the same tag similar to how we use the tag latest this would allow us to automatically upgrade even if we have 1 million data jobs running when we change the SDK the next SDK will automatically use a new SDK that's because the user code and the VDK code sort of coexist together so we have 2 independent versions of our system library for user code and then we simply run the Helm install command to deploy our control service it will expose a race API which can be used to deploy jobs and we will show what we can do right now for the race so what we are going to show let's run one data job we will show an example of data engineer creating an ingestion data job so what they will do is they will get some data from a race API which looks like this and would write some user code VDK to execute it and automatically create the tables and populate in the correct columns that's the great engineering created feature for the Overstyle data kit so first step we create our data job we name its example and the team name which allows to group data jobs also we can have some access control which you have to remember of a team to have access to the data job after that let's say that we develop the data job in this case we develop some ingestion data job the only thing entry point for all data jobs for all Python steps in the data job these are the methods which has high job input which is basically interface that you can use some methods to ingestion data to query data we can do separate talks about this in this case we ingest data into race target table and we ingest data from it will look like this basically from some JSON API sorry about the board screen and let's say that we run our data job for swacling and after data job succeeds it executed the ingestion and would have populated the table we can see now that the role is in the table if somebody knows what they can translate it and as the next steps we will deploy this data job so we basically as far as data engineering is concerned they just click software vdk deploy command in the directory to take care of everything they don't know that there are containers behind it they don't know that we store it in some kind of version control that we version it and release it it all happens behind the scenes it's all hidden behind single vdk deploy command and this is how we thought for now we already have analytics platform so we have a platform for data engineers that they can start using immediately they can including their existing data jobs fairly easy or they can start using writing new data jobs using custom sdki we configured the databases so it immediately connected to the existing data infrastructure we show this with database example and we show how they can be sort of customized in this case by improvisation team and also how we can create configuration plugins so that the configuration is ready available when people are using the custom sdk and not sure we have time for all three demos so I'll let you choose which one we want to start with the first one so I'll show how you can improve data infrastructure security by automatically anonymizing sensitive data like we'll show how we can which when installed the same data we saw there with configuration anonymized some of the columns without any changes in the existing data engineering code and the second demo will show how we can sort of improve infrastructure stability by validating some beta sql queries we'll write the plugin itself because it's simple enough and validate some use case for beta sql queries which one do we want to show first so raise hands for those who want to see use case one first okay and can you raise hands if you want to see use case two first okay they have more people for use case one so let's start with sensitive data improve data infrastructure security this is sort of an example of how we can automatically implement a journey and hide complexity to data infrastructure for data teams so what we are going to do is the same job that we showed in the demo the first demo we're gonna install anonymization plugin and it will automatically based on configuration anonymize the title field but let's see it's an organization plugin it's a POC plugin that I developed for the demo but because currently I'm not going to show how it's developed we just install it first we show that the data is not anonymized they like to sell it out to them does anybody knows what it is I should have checked I'm curious what it means and then now we will install the anonymization plugin installing plugins is simply pip install command so we install pdkpoc anonymize and now we have this plugin available in our installation but although I want to just install it we want to make sure that every single installation that's using custom SDK is using it so we just edit this new piece in our local structure my orc vdk vdkpoc anonymize now all local installation after this is pushed to some pp repository all local installation we prompt it to upgrade after we rebuild the docker image every single job that's running in the control should start using the new plugin as well so let's configure it in order to configure it we need to first specify that we want to use a preprocess sequence called anonymize that's how we named our plugin and next we want to specify which fields we want to anonymize again we do this in the configuration plugin we built in the first step so once this is patched in the my orc vdk custom SDK it will be used by every single data job deployed and when the data engineers upgrade the SDK locally it will be used locally in this case we specify it's probably not seen well but here we say this is a dictionary which says raise target table and is the key and a list of values in this case one value called title which will be anonymized afterwards we now will show how this is going to be used by a data team so let's run our data job the same data job we run again before after installing the plugin and we can see that after querying the table now the field is anonymized you can play with the demo yourself if you contact me I can send you a link to where you can play with the demo by yourself and that sort of showed our second demo it showed how we can basically by installing simply a plugin in some configuration without changing the existing data job we can see it's the same one our data is starting to get anonymized just make sure that some centralized team who wants to have control over these kind of decisions can make them in a central place so yeah, we showed how we can sort of enforce this kind of data security policies because it's applicable for all data jobs you can do as you can do anonymization because you basically can intercept what happens over the data when it's suggested you can do all kinds of things you can figure out you can start sending telemetry about the size of the data you can simply reject it if you don't want it and so on in part I think there are really no changes on the data engineering code I guess we have time for the second demo I'll do it as well again we'll sort of show a similar concept how we can automate abstract data journey you have any of you have situations where you have any of you operated databases like post-game asking, empower, achieve snowflake do you have experience with this if you have a situation where some customer user is doing something and also the database is extremely small or it's even crashing and we've had a lot of cases where we had to deal with these kinds of things in other cases from time to time users decided to ingest data using insert value statement and those insert value statements tend to be if you want to ingest 1 million rows they can be megabytes to gigabytes and the way we use empower and the way database changes usually you have master nodes and you have worker nodes they have terabytes of memory smaller also they are not optimized for needing to parse 1 gigabyte of query so they will tend to crash we the only thing we could do initially was to we simply would bomb the user and ask them to fix their query but if you can do this much earlier even during the development cycle that would be much much better so what we next do we sort of build a JDK query validation plugin so that we can intercept query and decide if this query is too big or not so let's build our plugin first we build query plugin then we'll register it as plugin in the control service in our custom SDK so it's applicable for all data jobs and finally we'll see how it's used by data teams first we build our plugin it's a SDK plugin it's a Python application so all we need to build is a simple Python application it's a VDK has multiple plugin hooks they are simple methods, they are based on plugins if any of you are familiar with PyTest it's very similar it's the same framework used by PyTest so what we first did is we simply specify VDK configure hook and we say hey let's we want a configuration called maxquare size this works this basically be automatically injected from environment variables configuration file whatever depending on how environment is configured that's why I do it in the next step we will use the already populated value and finally we implement our database connection validate hook it's the one that actually does it how much it's a special hook there are different DB connection hooks but this one is called before operations executed any kind of operation usually as a query so what we do is simply calculate the size of the query the size of it's pretty simple, land operation plus parameter length if bigger than the configured value then we say no, out of out next step we need to register it as a plugin in order to educate to realize it is a plugin all these extra things necessary on top of what you do in normal Python applications is defining this entry point called VDK plugin run in which you specify the name of your plugin, VDK my plugin validation and the package name which is entry point for the plugin after that we the next step is after we've released the pipeline repository repository we need to sort of edit our myOrc VDK our custom SDK is used by our data engineer data analyst in our organization to apply this plugin so we specify VDK myValidation and we after we deploy this it will be used by everybody and we can show now let's in our example I've pre-configured the max query length to be 10 characters and we can see that after the data job executed we have a query which exceeded those 10 characters so it failed so they get the feedback the queries to long before they reach production and the database crashes so yeah in summary what we did is we created this query validation plugin which intercepted queries and validated that they meet our case some query length in a way we sort of simplified our operations by providing better service for our users and same time ensuring stability and this reduced dependencies between different teams as a summary what we want to do it is devos for data service we want to enable everyone to focus on work that require really their core skills that could be different in different companies so we need to allow them to decide what their core skills is for themselves and in this way maybe we can sort of allow infrastructure people to have control over how the infrastructure use maybe devos people that can allow them to implement some best devos CACD practices and allow data engineers and data artists to focus on creating reports creating business logic transformations there are a lot of more things this is a new open source documentation is pretty poor so I really suggest you contact us if you find anything interesting you can find our contacts at our style data kit contacts there is also media blog with some articles which you might find interesting mediacom.com cool so that was my presentation I would really appreciate if you point your phones at this feedback form and complete it it's basically ask your feedback about about the project about the presentation and if you want to get involved feel free to reach out to us again you can reach out to us to the github page you can start the discussion, open an issue on this web channel we are very open and any questions I can give a free t-shirt for any question asked yeah so how does it compare with airfoil flight and similar tools so in a way airfoil and flight and perfect they are more like process orchestration you then don't actually for airfoil you have operator the airfoil itself does not know what happens inside the operator VDK is more of a data orchestration because it's more like virtualization layer where you use normal DB interface like DB API PEP 249 for connecting but we can manage it like we had virtualization layer before the query being to the database so we can add all these kind of things that we showed for example you cannot with airfoil there is no way to write an anonymization plugin which will be applicable for all jobs running in airfoil yeah but the data engineers we need to use it explicitly the way here is that you sort of abstract this from data engineers and they don't need to know about this plugin that even exists in the t-shirt or you can take it later any other questions you can schedule jobs with ground expression but we don't actually do workflow orchestration so the workflow itself we have integration with airfoil so you can actually use data jobs VDK standalone but eventually if you hit not complexity you want to sort of manage your workflows and have them depend on each other and define those dependencies you can install the airfoil and the airfoil provider and you automatically without changing your existing data jobs you automatically start scheduling them in the workflow yeah there was how about you reach out to me afterwards there is on our github page I don't have internet okay there isn't a community meeting which this integration is very new airfoil like a month or two old so there is a community meeting which we present this but if you reach out to me I'll just send you exactly the link so you don't have to search for it any other questions okay thank you very much for attending the presentation and I call to you a little hot hot hot test so you want me to put it in the soft okay let me start this is this clear in the back or can't imagine not alright so I'm going to give a talk about jsonld for link to open data and knowledge graphs first a little bit about myself I'm a cognitive scientist I've been working for lisp prologue for about 40 years but recently I'm the CEO of a company in the Bay Area called France Inc it's a lisp compiler company but for the last 12-13 years I've been totally focused on a graph database called AllegroGraph it's a graph database like most others except we are in a semantic graph database so I'm first going to talk about the world of semantics rdfs and graphs who works here with rdf in this one right who works here with jsonld no one okay great at least a fresh audience so I'm going to talk about semantics then I'll talk about the world of json and documents then I'll talk about the intersection of json and semantics called jsonld and then I'll talk a little bit about whether you want to store these documents in a document store in a semantic graph database of what the advantages and disadvantages are but a lot of I talk about just mostly about open data and open source data and what you can do with it so first let's talk about the world of semantics and rdf you might be inventing the htp html and then after a while he kind of complained that okay now we have this wonderful internet please remember this is 1994 only humans can read the content on the web sorry it says context I mean content but computers can't so what we really need is a metadata language that gives meaning to the objects on the web for example your resume on the web understands the structure of the resume or you have metadata for a recipe it will understand the recipe so he had an idea that almost every type of object on the web should be readable by computers so he thought about okay how am I going to do that and he came up with this language called rdf now rdf is a very very simple language and it amazes me every every day that people complain how complicated it is and at the end of the talk I will ask you guys I think it is very complicated or maybe you never thought it was complicated or maybe you never even thought about it so what is rdf the core is this idea that every object in the universe needs to have a unique identifier and the identifier should be an IRA because if you don't give the same name the same thing then it is really hard to build large integrated enterprise IT systems or almost anything so the first principle of rdf is give every unique thing a unique name the second thing is that you can describe almost anything in the world as triples and I will give some examples later so you can say Jans is a man very simple a triple and you literally can describe almost anything you can think of using these simple triples and then finally the third part of the semantics is that data should be self-describing for every IRA in your triples you should be able to go to a place that actually explains what the thing is what the type is or if it is an attribute of a thing what that attribute means in this particular context of this type and we call these things schemas or we call them ontologies anyone here ever worked with ontologies okay one and you should be so now let's look at some triples right so I want to describe that there's a person born in 1958 with the first name Jans last name Aspen that lives in a place with the name Moraga that is part of the state California right in the population of these places well the way computers really like to get these triples in is in a format called n-triples so you have a person one so it's got a unique URI person one has a first name Jans right so now there's one triple the first these things the subject of the triple is always an IRA although it can also be a kind of what we call a blank note the predicate is always an IRA and then the third part can be any XSD data type that you want to have now for human being looking at these triples is completely overwhelming so we have another format that's even an example code right so here we have the same same information that we just saw here right and I can literally show this to any person on the street right so to care what does that mean and the person will instantly tell you that there's a person that there's a thing with a person right anyway you get my point the only thing that might be a little bit surprising is the burr here because we use IDF type so very often is to give a type to a particular instance of a thing that we came up with the short hand called a right so now we can say person one is a person all right so now we have this and then if I would make a little bit more complicated right so I could say there's a there's a person that lives in two places is married to Sofia Jan Space Tax in California in the USA and they have a son in Stockton, New Jersey and again if I give you a minute then you probably would be able to figure that out but what you probably don't realize is that's already kind of a complex graph in there I have no idea if that's readable from this distance so this is a tool called gruff that we have a visual interface to our IDF graph database that we sell right but this is the visual interface on top that allows you to look at all kinds of different ways to to graphs so here you see the names of the predicates the types of the objects in the graph but this is hard to read you can render it in many different ways one of them is what we call the tree view where you try to use the shortest distance to lay things on the screen but anyway here's a graph right and now I want to do a query so this IDF also has a query language called sparkle it looks very much like a a sequel you'll see some real sparkle code in the second but what we do is we have like a visual query language where you can say well there's a person first name, last name and everything with the question mark is a variable right this person pays taxes in California and basically it expresses which you see here and if I push this button here then this is the underlying query but then I don't have to write code for but I just well here you see there's a place one question mark so this is a variable it's part of a state right there's a place two part of the same state here and that person one pays taxes to state one and he lives in these two different places and there's last name and the first name right you guys all know your sequel right so anyone that would have trouble with reading a query like this would take probably a little bit but you pretty I give a lot of tutorials and once I get sequel people then that's very very easy to explain it takes a little click on your head and then you got it right okay so this is the query language for RDF and then you might have heard I already talked about ontologies people also use ontologies to a describe an object in great detail and be to allow for reasoning over your data over RDF right so this is a tiny one here my saying that a person is well this symbol person is a class and a person is actually a subclass mammals right and the mammal is also a class of things and I could say I can describe predicates I can say his parents is an object property and then I can add something to start reasoning I can say well his child is the inverse of his parents so my database I have that Hans has parents Sofia and Jans but by telling the system that his child is the inverse of his parents well I can start reasoning first let's do the mammal thing because I said that a person is a subclass of mammal I can now ask the database is Jans a mammal and if I ask this question I get yes Jans has one is a mammal so this is one way to see simple very simple reasoning or does Jans have children I can say that person one has children person two but in the database there is literally no triple that says person one Jans has triple Hans but because of this ontology definition the system will automatically infer that I have children so this is the tiniest smallest IDF tutorial that I can give but I hope you get it now all fairly common sense ok now if you want to read about it the biggest standard on W3C is about semantics there's so many different standards there's the IDF standard for the triples there's AL that kind of does the definition of things and reasoning there's the sparkle query language and you get an IDFA which is a micro format for objects you get JSON-LD I'm going to talk about it later which is just a JSON serialization of triples SCOS is for taxonomies anyone here working on taxonomies ok so you know the taxonomy is right it's a tree of words that describe the important words in a particular domain right so you I think I'll get to that in a minute anyway many many more standards but the W3C has been working for 20 years now on the standards for IDF every derived technology so now where do people use IDF the first place is nearly every Fortune 500 company is now building SCOS taxonomies and I'll go into detail a little bit later it's used for the linked open data that I promised to talk about it's both for public data and for enterprise data and so I'll talk about it some more and then nearly every Fortune 500 company is right now building knowledge graphs in your companies people are building knowledge graphs not ok so another fresh area to talk about good so let's talk about IDF taxonomies and ontologies so all as I just said all these companies are now building taxonomies and all of them are IDF based basically for example in healthcare there's a taxonomy that describes every concept that is sorry for hospital taxonomies for hospital people have built a complete massive system of words that are important in the context of a hospital so every medical term will be in there but there's always hierarchies between words so basically you always think of a hierarchy of words in a particular domain sorry I said words but actually I'm wrong we talk about concepts because a concept is also just the IRI and then you can have a pref label like lung cancer all the variations are how people talk about a particular lung cancer but lung cancer itself is like a subtype of cancer in general but also a subtype of lung diseases so taxonomies are very complex trees and graphs of how a bunch of words hang together and maybe you've heard a word net created by Princeton to describe all the words in English language it's kind of an overall general taxonomy of important words anyway so people built the taxonomies and they do it for many many different reasons to improve search over documents because if you have chosen say you have a concept for lung cancer with the Latin terms in many different with the pref label sorry you have a concept for lung cancer right so that's just an IRI it will say that that lung cancer has pref label lung cancer it will have alternative terms and so now what I can do when I search I can search for any synonym of the word lung cancer but modern search system will still find it because they linked all the meanings of lung cancer together you can improve your NLP process basically the same thing some of you guys must be doing entity extraction from text where you try to find all the important words out of a text but again you don't if there's a big document in this five ways that people refer to lung cancer you really only want to have that concept you don't care how the text refers to it you just want to know what the ultimate concept was that was in the text so people use it for that people use it to harmonize documents and then we use it to build knowledge graphs knowledge graphs are basically applications of graph databases where you try to collect everything you want to know about a particular entity that you're interested in and the relationship of the entity to everything else in your company so those are knowledge graphs and I'll talk about it later a little bit more now most taxonomies and ontologies sorry I don't know what else so so within companies people make these taxonomies in ontologies but also in a particular enterprise domain or field people need ontologies like in healthcare or in a bank or pharmaceutical industry so there's actually big efforts going on cross enterprise where people say okay building a taxonomy of all the important words is something that I don't get any competitive benefit from let's just all work together so we can share data and so there's a lot of efforts going on right now in all kinds of domains to create cross domain taxonomy so a big one is in life sciences for example at Stanford there's this website called BioPortal and they've collected several thousand life signs for example taxonomies in ontologies right so if you want to know something about a particular loan can so you look it up on this website you get all the taxonomies where that concept is important and this is all open source by the way and then about more than 20 banks work together for the last 10 years to come up with a taxonomy and an ontology of all the important words in the world and concepts of banking so that it's easier by the way this is mandated by the government so that for the banks sorry for the government it's later easier to compare banks on various measures because if everyone uses different words and concepts it's very hard to compare I hope that makes sense so engineers do it, gas and oil has some lots of examples so that is about taxonomies and they're very very important because without a taxonomy you can't build knowledge graphs or do a lot of things the other thing is linked open data and there's basically what it is you take public data and you make it available as these triples as RDF now Europe started with this like 10 years before the US started promoting it so they're spending a lot of money in a research project and they now for example force if you want to get a research grant in Europe you have to take 5-10% of the funds you get so that after the project is done you describe all the data sets all the metadata so you put your data set somewhere and you put the metadata that describes exactly what's in this data set and they're used for RDF technology usually people do research they pick the little thing that gives them a great publication and after a while all the other stuff is gone but so many uses and people could use the data again for their own experiment to get more insights so anyway that is already mandated in the EU that you use RDF to describe the data and then in 2021 the National Science Foundation and the White House started a big project to kind of look at all the data in the government that is publicly available and do something useful with it in what they call these knowledge graphs that I just talked about and again 99% of these knowledge graphs will be RDF related any questions about this and then there's also companies that then just on a commercial basis available but for example in enterprises where you take enterprise data you use the same principles that you use for public data except you use it within the enterprise it's a company called data.world that's creating all kinds of technologies to take your data lakes and make all the data available as RDF but they also have a mission to take all the public information in the world as public data that you can work with later interesting company but all this linked open data stuff started how long ago maybe 20 years ago where people took these public data sets and made them available can anyone hear where dvpedia there so dvpedia is basically all the info boxes dvpedia page made available as triples so this is an enormous amount of triples by now 15 billion triples but literally everything in the Wikipedia you now can just do a sparkle query to find a concept or article in all the relationships to other articles and that's why it's in the middle because this is probably the most important one but then there's some other ones here linked clinical trials this is 350,000 clinical trials made available as triples and I'm going to do a little demo I'll show you something about that something like geo names if you have heard of that it's a database of 7 million places on earth with all kinds of metadata and it's also available as an RDF triple store and I have this little story here I don't even know half of them I probably know half of them right very interesting data sets that you can download from somewhere and then play with and do useful things with that this was the linked open data cloud in 2009 this is the one in 2022 it's become absolutely completely totally unreadable so now you still can go to this cloud but then you'll go to the top here and say okay I want government this is not interactive but you would click on government and then you see a much more reduced cloud that is specific to the government so it's kind of exploded a number of interesting data sets that you can do something with it's very active in Europe but I'm beginning to see the same thing in the United States okay so now so you have all that linked data right so now what can you do with it you have that open data here's the thing that I usually demo I'm not going to demo it today I'm just going to talk about it but for example you want to know that was the median income of the area where Barack Obama was born well when you do a sparkle query to do that then you first say well find the birthplace of Barack Obama and then find the GeoNames ID of that birthplace so now with the GeoNames ID right Barack Obama was born then you go to the GeoNames database you find all the GeoName IDs within 10 miles of that thing using some geospatial capabilities that almost every semantic graph database has and then you have the GeoNames and now you go to the census database that's also available as a triple file and then you find the median income right there's a big sense and so you can do a query and I'm not going to explain that even today but this is the query that I just did in English right is in sparkle and I find the four places within 10 miles of where Barack Obama was born but just think about it three groups of people worked on their data set in a completely independent places but because they really believed in the principle of giving a unique name to unique things you kind of can automatically integrate these data sets and do very interesting things for them I always make the joke to talk about the importance of a unique name for a unique thing is that in our semantic community well let me start the other way around if you ask a young developer in a company to ask a new application that needs a relational database that person that makes that application there's not a single neuron in that person's head that thinks about how is my database going to work with every other database in my company, doesn't happen ever, never in our community when we make a data set the only thing we think about is my data only becomes more important if I can share it with everyone and link it with everyone otherwise I make a very pathetic little silo thingy here right it's a huge almost cultural difference between the people in this world of regular modeling in the semantic community semantic community data sharing is the most important thing because that makes your data way more useful than if it's only for that one application that you're trying to solve okay so now I talked about taxonomies I talked about linked open data now let's talk about knowledge graphs so again this is a fresh audience with respect to knowledge graphs right you haven't done much with them I'm guessing so you know that what Gartner does they say what's most important but basically they do that by looking at what's happening in the world right so what they notice is that in their hype cycle that hey knowledge graphs are important right that happened in 2018 before you knew it right in 2019 and actually in the latest version that I couldn't find it's already over the top so many people are doing it for Gartner it's become boring you just make knowledge graphs now why do people make knowledge graphs well I'm sorry not so a lot of companies are doing it distance to the conference the last conference before we had the lockdown right I was there and I got this overview of the little conference that we went to and what you see here is all the big companies literally building their knowledge there's Goldman Sachs Capital One Wells Fargo there's Google there's Xenica Amazon Accenture literally everyone is building knowledge graphs with knowledge graphs again this thing where you have entities that you're interested in and you gather all the knowledge you can find about these entities and you add them to that entity and then of course you link it to other entities right here's some commercial slides that I'd like to show now people always build knowledge graphs for almost one reason and that is to get rid of silos right I mean in every enterprise some of you work in big enterprises you literally have tens of thousands of databases right and none of them can talk to each other unless you spend a lot of money to make them talk to each other and then data science everyone wants to do data science nowadays but data scientists spend 90% of the time on data engineering right getting the data clean and preparing it right and then most solutions most ETL solutions data lake things master data management they only make it worse so that's why people believe that knowledge graphs are a solution to removing silos but also to make knowledge graphs can make artificial intelligence way more intelligent because if you do some people are data scientists here right you some of them you get your tables you clean up your data so now you do some interesting data science work on the tables that you got and now you learn something right but what do you do with the stuff that you learned you put it in a report you send it in an email and said oh great I found a new prediction right maybe you made it in a pipeline but it's only a very isolated with knowledge graphs what you do is when you learn you actually say oh I learned something about this patient then you put what you learned back in the knowledge graphs and now you enrich your knowledge graph say about patients with things you learned in your machine learning doesn't make sense so by taking the output of your machine learning and put it back in the system where it came from you create learning systems so we're working on that and the market is growing fast at least once I hope it's good for us that it's going so fast in the predictions at least right but even more important a lot of people when they talk about knowledge graph there's a new word that comes up it's called data fabric are you guys working on familiar with data fabrics basically think of it as a mature person of your IT infrastructure sorry where how do I say that where you make sure that all the data that is all over the place is cataloged right so to begin with you catalog every database that you have all the tables and columns but also the metadata like what applications are using it from what application it's getting data what are the people that are maintaining it what business line what is the purpose of this and then how does it link to my enterprise ontology of all the important concept in my enterprise right so you kind of get a maturity level on top of what you already have so it doesn't change your IT infrastructure but it adds a little bit knowledge and intelligence to what you have that's what they call data fabrics and the knowledge graphs are important component and let me just go here so this is a slide from Gardner knowledge graphs are the key to data fabrics and basically knowledge graphs can bridge all data and metadata silos for seamless data integration management and they make your systems way more intelligent by integrating rule-based reasoning and machine learning in one data infrastructure okay some commercial slides now go back so an example of a knowledge graph and I can't go too much in detail and how much time do I have oh doing well so one example of a knowledge graph is something we built for a long covid researcher so you know long covid is a really really big problem and people are starting to research it but in order to study that you need to look at many different information types so if you want to know about say you want to say something about the phenomenon of long covid you really want to look at all the clinical trials that have been done right so we took the open source data actually from the FDA the the clinical trial data you have to sign some kind of agreement but literally everyone can get that data right they just want to know who's using the data and then you turn it into triples and there's all kinds of variations that you can download or you can make it yourself then there is the PubMed right so every day there's 2,000 new medical articles that a human being can ever read it again I mean that's I mean 200 years ago you probably still had the chance to read enough of it but now it's impossible and so we have the system called PubMed right abstracts in the article itself and some people made a selection that was all covid related they call it course 19 and it's a massive collection of documents that have something to do with covid then we have the FARS which is the at first vaccine reactions for all the different covid shots that you could get right and then we have medical records but for our demo purposes we're using a system open source system for MIT called Cynthia where you can generate patient data that's actually based on a lot of machine learning on real people but of course for the real work that we do we have actual patient data but for an open source knowledge graphs that we make available we use a synthetic data anyway so you have all these data sets but all these data sets are slightly different words right so we also have a thing called UMLS which is published by the National Library of Medicine and by NIH and it's a massive collection of life science taxonomies and again many different research groups have turned that data into an RDF taxonomy so that's for us the thing in the middle and then we use natural language processing and basically just concept matching to take the important words in clinical trials and harmonize them we do analytics of the data in PubMed all the text and take the important concepts out we harmonize the side effects in FARS right and then we have electronic medical records and now suddenly we can do research that spans all these different data sets and all open source except the medical records and so here and I apologize it's probably too small right I came in this morning here just look at the room and I saw that this thing was way too small so I blew up every slide ahead but I didn't do this one okay so let me just explain what you see here so this is a picture of an object from each of these four data sets so we have a medical record so there's a person in Rody that is from Lewiston, Maine non-Hispanic female is on some Ask Myself management program and she had an encounter that was paid for by Blue Cross and she has acute pharyngeitis so this is a concept, a name which is then linked to the ICD-10 name which is the billing code that hospitals send around between hospitals and insurance companies anyway this thing is then related to the UMLS I just talked about right so this looks the same name except this one is in SNOMED in the metatosaurus a taxonomy system and this thing is a more specialized version of an acute aporespiratory infection which has another manifestation which is the common cold right but anyway what you see here is some terms from the UMLS and then what you see here is the clinical trials and clinical trials here you see two clinical trials they talk about pain and short throat and pharyngeitis and you see how that links to the UMLS that you saw here then you see here two papers from the course that link information from these things to hypersensitivity in the sore throat and then you see here FARS now this goes way too fast but for research this is fantastic they cannot take information from all these different data sources and use one simple query language to start doing their analytics and any questions about this does it intuitively make sense what I'm showing you here alright that's good enough so now onto Jason Elby right so knowledge graphs are getting popular very fast but the developers are really a little bit scared to learn the W3C semantics tech they think it's hard to talk with but a lot of people say oh no I'm not going to learn all that I just want to stay with Jason so how do you make it as easy as working with MongoDB to add data to a knowledge graph to retrieve data from a knowledge graph and to validate the data in a knowledge graph how do you do that well basically you give them what they want but just Jason but then a little bit added and we call it Jason LD so everyone here works for Jason Jason LD is 100% the same as Jason it's 100% syntactic Jason except you do two things did I describe it okay let me do a little bit later on this so you guys all know about Jason the lingua franca for messaging and data in chase used for configurations used in document and key value stores Jason has a very simple standard easy to read and parsed by humans and machine to store in document stores an easy to program I'm still in awe how beautiful Jason works together with Python dictionaries for example it's just really a brilliant synergy between the two but now you get it but Jason you sometimes run into trouble for example in the hospital that we work with they have more than a thousand different Jason types that they stream between various applications and the knowledge graphs usually via Kafka but other systems so you get all these types and most of these Jason objects need to be persisted but you don't want to put in all these several silos you need to put it in a data lake that you can do something with later but the problem right now is if you look at any random Jason object you literally have to go to the code to figure out how do I deal with this particular data type if you make something with five different Jason objects it doesn't matter if you have more than a thousand you're in trouble and then the other problem with Jason is in the hospital you have a Jason object for the patient and then you have encounters which are objects which have diagnostics which are objects maybe a profiler maybe a link to the taxonomy so how do you make the choice as a developer whether a patient is this massive big structure a nested structure or whether it is composed of smaller ones but Jason doesn't really have a standard for pointing from one object to the other so what you really want from Jason is to have semantics and what I mean by that is that I think it's important needs an identity that other objects can point to that's why and the only very simple thing you have to do by the way is you just take a Jason object and use it as an attribute at sign ID which is a Jason LD attribute 100% valid Jason and you give it an IRA and because you could have a default namespace it still can be a very simple name but it's important you give every Jason object a unique identifier more readable than the Mongo identifier most of the time a readable identifier and you need a type and the type is also at sign type but the important thing is that each type then has somewhere else an RDF description that says what are the what are you trying to do with this type what are the attributes what are the data types what are the links to other types of objects and maybe even more with that metadata on what version are we right now how does it link to the processes how does it link to the apps what apps are using it all the relevant metadata is in that ontology that describes the particular type and then one more thing that you could add to this Jason LD object is what they call a context which is actually a pointer to the ontology that describes for this particular type but that context can be the same for almost every object in your hospital okay and then you want a validation language so Jason LD is just kind of a schema and it adds semantics and it's designed to link objects together and it enables you to do joints and graph search in document stores and if you want to play with it you can go to a jasonld.org and there's a playground and you can choose examples like a person event the place of product, the recipe, the library and activity and what you see here can you guys even read it sorry this is the context which in this case points to schema.org schema.org is something that Google and others have funded and it's a massive ontology of e-commerce products then you say that the type is a person and the identity of this object is person one name is Jane Doe and actually when you serialize this Jason object you get the triples here right but in most Jason LD the way we do it we also store the blob itself otherwise you are way slower than one would be to get the objects back if you first have to pull it all together again so we store it as a blob and as individual triples and Jason LD is now everywhere so Google now recommends that that's the most important way you can mark up products on a web page in the word language now so if you take mad hippie oil you look at the source which you'd see somewhere in there is Jason LD and you see then this whole object stuck in your page and then just to show the difference between meaning and no meaning if I had to take out all the Jason LD things I would get a very regular nice to read Jason object except the computer doesn't know how to interpret all these attributes but if you add the fact that this is from schema.org and this is of type product and there's also an aggregate rating link to it now Google can read this and they know what it means and they can do interesting things with it right I already talked about it so if you want to look at schema.org there's a whole website just to look at the ontology of every product or thing you can have on the web and then I have a little demo of what you can do it so our database is to by the way store Jason LD so you can do use it as a document store but you can also look at it as a graph and do very very complex graph analytics or complicated joins on your data anyway using MongoDB here okay so let's see if I can get it to work here so if you want to work with it then here's a Jupyter Notebook that we make available so you can find it on our website if you just want to play with Jason LD and add objects to the database and so you could make a database a new database and then zero objects in it and here's some objects from the Jason playground that you can play with and here you see that they represented us as Jason in Python but let's just go to the main let me go here to the main thing let's actually now I want to do a demo and then I'm going to do this how many more minutes do I have so I want to show you a little bit about how you can read Jason LD and represent it as a graph and still have Jason LD so here what I've taken is a database that's openly available was published by Crunchbase in 2014 I believe a while back and I use it for demo purposes basically it's about all the startups that got a funding event it's also about all the VCs and big buyers in the United States and you can get data available as Jason LD or pure RDF and so for example we want to look at MongoDB MongoDB and here's MongoDB right and I can look at that and let's see and so MongoDB had various rounds of investments so let's see who invested in this oh here oh sorry now I have to do it so here we see and let's look at it over time right so here so what happened is at some point in time Union Square which is put one and a half million dollar in MongoDB and then after a while Union Square and Flybridge put a little bit more money in and then Sequoia Capital got involved and then did a follow up round and then Sequoia Capital and Incutel got interested and before you knew it you get who was at Salesforce putting in $150 million right so this is the first part and I can actually look at MongoDB here and you see here the the Jason LD object which you also see that the whole thing is published for example what was the category for this thing what city was it done by the way I talked about RDF I can push a button here and you see all the underlying URIs for these predicates and for the values but I usually like to look at this way or I could look at a particular venture fund I can see all the investors how much money when it happened etc and so this is MongoDB and of course at some point we also had couch base couch base which was a competitor we can look at who put money in them and the thing I always find interesting is well let me first look at these okay well that's fine so here are the people that put money in couch base and I always find it interesting to see how two groups of investors both are trying to push a document store right so first you got an investment there more investment and then north bit ventures and the others said hey we need to invest in a store too and they got more money and then they got more money so you can see how investors were fighting against these other right to be the first to get the biggest store and so what I'm interested in and so what I found interesting is you look at this and you you see that if you're an investor in couch base then you won't never put money in MongoDB right if I was the first I mean why would you do that you would bet against yourself but still these people work with each other right so I still can say hey was there any relationship between do they ever invest in something together right and then you see oh yeah well they have other things that they work together right so north bit ventures will work with Incutel for example here they probably had I don't see who they invested in Q-Day Fission okay and what do they do there's a category nano technology and organic semiconductor video but not database stuff right so what you now see is we started out with a bunch of JSON-LD projects that I put in the database right and I have this beautiful graph and I could do a query where I say so what are the things that people invest in no so I have venture capitalists here that invest in couch base venture capitalists there that invest in MongoDB sometimes they do it together but what are the things that they then together invest in what kind of categories right say you want to do that as a query well then you could just ask let's say let's let's take these guys here Incutel this and this and this and this right so this is a trace that I have in my graph right that I saw here and oh no I forgot one thing I also wanted to know what the category was okay now I can copy it here now all I want to do is do a query where I figure out what are the things you're investing in here I will do this I turn these things into variable nodes I don't have to write any code I just look at the graph that I want to use as a query and here let's say convert a variable node new node variable let's make it a category although it can be any words right and then if I run the query which you see is this it writes the sparkle query and you get these results which you probably want to do a little bit of aggregation I didn't want to write that while I was doing this let's see here so this is the same query but with a little bit aggregation select category and count categories count right and so you they invest together enterprise software mobile analytics finance advertising blah blah right so this is a complex query that you wouldn't do in MongoDB or you would write a whole say a Python program or whatever program right to do the complex joins and so that's why I wanted to show you this demo but maybe because you guys believe me right that you can do this in Python etc so let's have a discussion what you guys think of this why you think it's not useful for the things you do or maybe this stuff that you think you can use this for well it depends on how deep you go right we have a recent test where we load about 17 billion triples from all of Wikipedia data and we have there's a researcher that makes a query set available with 850 queries and 90% of the queries can be done within 200 milliseconds right but I can write a simple product query and it takes a very long to get back right so it depends totally on the nature of the query that you have and whether the database is warm or whether you have enough memory and the graph database are a little bit more memory intensive than the other databases because there's so much random hopping around if you do graph queries alright other questions you're going to train it on those Slack articles you want to train GPT-3 on those Slack articles okay well it seems right now everyone is getting into that right we're doing a lot of work with it too and well so in your case I really would build a taxonomy of for example all the skills that people have right there are skills taxonomies that you probably can find in IDF but as soon as things get and once you have a taxonomy you can use the taxonomy for entity extraction if you don't use a taxonomy you basically get a bag of words without much structure no tools to help you with that if you have a taxonomy you have a structured set of concepts with their synonyms or what have you right to analyze LinkedIn by the way I know the people at LinkedIn pretty good they have taxonomies right of office skill and type of work the people do and type of company and they have their own new graph to the base thinking of making open source called liquid it's also an RDF based triple store but they're deep into RDF too for this so we are doing a very fun project in natural language understanding where we use GPT-3 to actually create triples directly out of text that kind of reflect underlying meaning now GPT-3 is both smart and super dumb they don't really realize but we're using it to create rules to analyze text to extract parts that we then prepro... further process and now we... because the problem with GPT-3 is it knows statistically how people proceed to sentences then they have another layer on top to make sure that it's grammatical and syntactical the pragmatics is missing of the underlying real world model we're trying to use GPT-3 to help us build up... give us clues so we can then ourselves build up the underlying real world model a domain model that consists for us out of triples now you can do... I didn't prepare it but otherwise I could show you how far we've gotten with that to do natural language understanding so maybe you should contact us I'm on the waitlist too right alright any other questions yeah are companies expertise that's a very good question to begin with I built in 2008 the first version of a triple store we already had a Prolog compiler right and so you can just map Prolog directly on top of triples right it's just like the underlying Prolog database can be a triple store so that was amazing but the whole point of Lisp is that a program is data itself that you can manipulate that again is extremely Lispy in this RDF layers of layers layers of abstraction so I could talk about that part but a great question alright oh sorry were you Shackle? actually I wanted I didn't know how much I knew how much time I had but I thought I would never get there but in the demo that I have we actually literally have these objects I can show you here we do JSON LT so here's the Shackle context that you add to the object all the important parts of Shackle decorations but let's do it here here is a Shackle JSON-LD expression which basically validates a funding round right so we have JSON-LD to specify to your answer the Lispiness of this right so we have JSON-LD to validate JSON-LD so we say that the ID of the Shackle shape the type it's a node shape and the class that we're trying to validate with this shape is a funding round that I just showed you right and then it says that if you have a funding event you can have only one it needs to have an infest here right exactly one but you can have multiple investors and then just as a demo I said the minimum number of an infest needs to be more than $10,000 which is stupid but I mean this is an example and then I do in my demo that I'm not going to do right now I make a little database of funding rounds and then I delete an infest T or I delete an infest here or I change the has raised amount to under 10,000 right and I turn a number into an integer say okay validate and it will find you all the objects that violate those things so it's a built in validation system yeah because as a company you first try to deny that they want a new feature from you right and at some point you can deny it anymore then you try to find someone to finance it and so we finally have the customers that now I really want Shackle and so that's why we did it and because we have to see the commercial value right but now everyone has a checklist Shackle Shackle Shackle and no one asks about Shackle and now it's there's too many I mean there's so many standards that we're right now in the middle of the thing called RDF Star we're in the middle of RDF Star where you can use metadata statements about other triples right so that is yeah okay well thanks very much for your attention then oh I'm so sorry for you thank you standard sorry sorry yeah oh yeah whatever come in and they were all going to commit and they didn't do that. I expected you and you know them well and you got the money. And you've been doing it for me. I honestly didn't have a guarantee that I was going to take a lot of you there right now. Now I'm going to use a couple of three-something that I passed out to bring you back to the judge's room and have you wear it too. Yeah, and you know about the difference between each other, right? Oh, well, it's got to do with that in regards to that question. Actually, let's take a look at my room here you're in now and yeah, I'm just going to put my coat is all posted on the, on my GitHub and stuff. Okay, testing, testing. That's me whispering. This is the normal voice. Should we turn the volume down? I'm just going to take it away. I feel like that's still pretty, I just won't shout. The microphone, is that, is that okay for you guys? Is that too loud? Okay, so this is getting to know your data. An SPL prerequisite, SPL is the acronym used for Search Processing Language. People think it's Splunk, something language, doesn't matter. I'm Mary Cordova and I'm going to help you figure out what's going on. There we go. I think that'll be better. So I've been working in security operations for nearly a decade. I started in SIM engineering with a tool called ArcSight. I've been working with Splunk for about the last six, seven years doing, you know, SINSOC, IR, Politics for InfoSec, what else is on there? SOAR, and then I've got a bunch of certifications and I've done, you can see various talks. We're going to talk a little bit about why you need to get to know your data before you start, you know, building dashboards and visualizations. The process and then the method that I use. And then we can do a live demo and you can kind of like see that in action, hopefully if the phone works. And then a few resources and some wrap up and questions at the end. By the way, everything I'm going to use for the live demo is publicly available. So you guys could literally get out your laptops and kind of step through the same steps with me when we get there. The talk itself is pretty fast, so I think we'll have a good maybe 30 minutes to mess around at the end, if you guys want. I'm a member of the Splunk Trust now, but how I got into the Splunk Trust was the argument that I was having on the side there with some current Splunk Trust members at the time. I had posted a question and then my own answer to that question. And the first statement was if the data set permits. And they actually deleted my answer off of the platform because, you know, they say there's so many unspoken caveats to if the data set permits. But in my opinion, you have to understand your data before you can do anything. So of course, every answer to every question is if the data set permits. Once you get past syntax, right, pipe, table, command, then from there it's all about does the command you want to apply to your data give you the expected output and that's going to be contingent on your data. In Splunk, and I think that this process and method could be platform agnostic. So if you were using something like Elk or another tool, maybe BigQuery and Looker, the process might apply. Obviously the commands won't. In Splunk, a lot of times, you know, there's an app or a TA, right? So you're going to monitor Linux logs. There's a Nix app that you deploy on to your search head. And it does all your field extractions for you. What happens, though, all the time and is the bane of Infosec, some engineers, vendors will change their schema and then everything you built no longer works. Also, I pretty much don't ever trust anybody else, so I might go install an app or a TA, but I will go through basically the exact same process to verify that their work is the work that I would have done. There is a TA, it was a few years ago, I don't know if they fixed it, but a Cisco one built and deployed by Cisco and available on the Splunk base that was parsing network traffic so that it looked like it was in the reverse order. So all of your inbound traffic looked like outbound and vice versa just because they had source and IP destination flipped. So even if there is an app, even if somebody did this work for you, you might still want to do it again or you might have to do it later anyways when somebody changes something. So here's a process that I use, it's three main components, data preparation. This is going to be like 90% of the work. Publishing your base search is just once you've gotten all of your data analyzed, sorted out, figured out, you've identified the most important 10, 15, 20 fields in your data, they're all normalized and clean. You're just going to save that somewhere so that people can use it easily. And then you just reuse that base search in all of your analytics, right? So you'll have maybe the first 10 lines of the search are all the same across an entire dashboard and it's the last three lines where you want to slice and dice things differently that get changed into an analytic product. So at point is important too, if you have these base searches published, then you aren't the only one that has to do all of the data analytic work, right? People don't have to come to you and say, I need a dashboard with some pie charts and bar charts on it. You can be like, here's your base search, go build your own dashboard, I'm super busy with this other stuff. So that's one of the other things where if you get the data preparation and the base search done, you can out force the analytic to the people who own that. Because no matter what, as soon as you publish a dashboard, people are like, that's great, but now could I see it by and then they want to split your data by another dimension. If they can do that part on their own, then the analytics is more democratized. This is a method and we'll go into each of these more deeply, but under the data preparation stage, first you want to remove all your noise. There's two main pieces of that. There's some Splunk noise, just some metadata stuff that Splunk injects in there and you don't need that for analytics. So you want to remove that. There's also event noise. There's always extra fields you don't really need. They're not helpful or there's duplicates or if you've got an app installed, you might have the original fields and then the parse sim field. So you just want to get rid of the noise and I found that's really helpful for me because if you're looking in a pile of dirty laundry, you can pull the things you don't want and throw them in the corner and eventually you have only the stuff you do, or maybe clean laundry, right? You have only the stuff you do want, so that's kind of what we're doing here. And then once you've got the fields you do want, you want to start to normalize those, right? So it'll be field names you want to normalize and then also field values. And really just, these are like the most basic commands that you're going to use as soon as you start using Splunk and that's like 90% of everything you ever need to do anything with, really. Deploying your prompts.com, so there's a couple things you could, once you've got your search and all of your data normalized, you can publish that into a configuration file or you can just save it as a base search, either way. And then your base search, again, 10 to 20 most important fields and then your analytics. So when I'm getting set up, you want to always run in verbose mode. You can kind of see down in the right-hand corner there. So run your searches in verbose mode, set a static time range. So you don't want a rolling window because the data is shifting then underneath you when you're trying to perform your normalization and your analysis. You don't want that to happen. Later on you'll spot check that, but for now you want repeatable, reliable things so that when you make a change in your search, you can see the output and you know it wasn't the data that changed, it was your work that did that. Splunk then has the SIM data model. So depending on what I'm working with, in this case, I was using Sysmon. So we'll see when we get to the demo. There's some ransomware data that Splunk publishes that you can play with. So I was using the Sysmon ransomware data, so this is going to be endpoint event signatures and intrusion detection, data models that I want to map my data to. And then of course always the admin guide too, right? Because there'll be things like event type one. Okay, well what the hell does that mean? So you need the admin guide to translate that for you. You can then build those translations into Splunk but without the admin guide you won't know. All right, so removing the Splunk noise. So when I get started with a brand new data set, I'm going to assume you already know where your data lives. And by that I mean your index and your source type. Somebody had to ingest the data and put it somewhere. Maybe that was you, so you know where it lives. If you don't, I maybe should have included a slide on how to find that. But let's assume you know the index and the source type for your data. This is like the first line that I throw into any search. I just get rid of my Splunk noise. And then what I do is I start to, well first we want to focus on fields that are present in 100% of all the events that we're looking at. The reason for that is because if you do search, let's say, action equals star. That'll return all of the events that have action as a field in the event. But if you have 50% of your events that don't have the action field in it at all, those events will get completely dropped. And maybe that wasn't exactly intentional. Because maybe you were thinking, why even want the ones where action is empty? That will get empty ones, but it won't get ones where action doesn't even exist. And sometimes people have a hard time with figuring that out at first. So I'll show you in the UI where you can see, select the fields that are 100% of your data. Because those are going to be guaranteed you can use in your analytics without arbitrarily losing data you didn't mean to. And then I'm just going to table all of those fields out. Because I just need to, I need to see them. I need to see if they're useful or not. And then you see I've got this new fields minus. Oh, and I will say the dirty laundry pile that we're throwing into the corner of the room, that's fields minus. So fields minus, that first line that we saw in the previous field, that was getting rid of the splunk noise. And then we have another field minus here. And I figured out that this was my event noise. Well, how did I figure that out? Well, I had to go look at my data. So you almost have to know your data before you can get to know your data before you can figure out what to get rid of. But that's why I table it all out and then I can start to see, oh, I've got a bunch of duplicates. I don't need 15 versions of the same field. So I can throw those out. The other thing also is it's the same command here and you're like, okay, well, why did you use two different lines for that? It could all just be in one. I like to keep my piles segmented, right? Because maybe I have like a black pile of dirty laundry and then like a red pile of dirty laundry and a blue pile. Because later I'm like, oh, you know what? Let me go look at the red pile again. I can find it more easily this way. I will say there is literally no dirty laundry at my house that isn't black. Oh, gray, I guess. And then once you've identified to, in the previous slide, I had the second fields minus, right? Fields minus event channel. Am I sure they're trash? I will usually go back and double check all my work, right? So I flip the second fields minus to table plus, which then only displays those fields. And then I ran a quick stats value, star to star. And that gives me a quick visualization of all of the distinct values in each of those. So I can say, yeah, for sure, event ID and task duplicates. Keyword, completely useless because I don't know what it means. And it's the exact same value in every single event. So it's really not giving many contextual information. I know it's Sysmon already because that's my source type. High cardinality fields. So fields that have like thousands of distinct values will crash your browser. So stats values, star to star with caution. And then you want to start to identify the important fields here. So you can see now I've got three piles of dirty laundry. I've got still my table command, which are the fields that are still important from those ones that are 100% prevalent. And I just wanted to start to work with some of my most important fields. So I just saw, you know, I've got four hostname fields. Let me look at all of those and figure out what's the difference. Which ones do I want to keep? Which ones do I want to trash? Do they need to be normalized in some way? And I'm just going to use, again, our basic field table stats, the UI. We're going to find in group fields that have duplicate and similar values. Make more trash piles. Figure out which ones we want to keep. People will look at this. When I'm working with them like one-on-one and we're doing a screen share, they want to tell me, marry, marry. Fields, fields, fields, tables, tables, tables. Don't worry about it, right? We're not actually building a search. We're just doing data analysis right now. So none of that really kind of matters. Don't worry, it's okay. We'll clean it up at the end. Is anybody relatively new to regular expressions? Are you guys kind of okay with regular expressions? If you're not, this is a really excellent book. It's like a quarter-inch thick. It's like 15 bucks on Amazon. I think I only read the first half of it. So you can get through like 30 pages and most of the regex you'll need for anything except like regex competitions, perhaps. So then you start to normalize. And you can see right here at the bottom, right? I decided to rename device as DEST. And then I also did a regex to extract the destination NT domain, which is just the domain and not the full FQDN. And then I made a new trash pile, right? Because I now no longer need computer device or host. Those are all duplicates. And then again, I'm gonna spot check, right? So where I just invented a new field called DEST NT host, well, I know for a fact that my device field was present in 100% of my data, right? Because that's what I was starting with. 100% of all of my fields. If I perform a regex, and then I do where is null the new field I created and that field is empty, then my regex is screwed up somehow. Because if I've come from 100%, I should be preserving still 100%. It's also a little bit of art and science. I know that like a SIM schema should be, like the SIM schema is the SIM schema, right? But, you know, why did I use DEST instead of device? Turns out after I worked with this data set for about four days building this talk originally, I ended up not doing this at the end. I went back to the original device and destination. But when I, you know, day one, I was like, device, I hate device. It's like this weird intermediary thing that isn't exactly clear if it's the source of the destination. But by the time I got to the end, I was like, oh, okay, I understand now with the way that the traffic and the data was working in this data set that I actually needed the intermediary and the two directions. So it's not exactly cut and dry. And everybody's gonna do things a little bit differently. Fundamentally, you should come up with the same result at the very end, right? You should, at the end, have roughly the same fields. It's normalized roughly the same. And then you just iterate, right? We walk through all of this again and then we start to look at fields with 90% coverage and then 50% coverage. Because a lot of times those 100% coverage fields, it's gonna be like the host name, the IP, whatever, but it might not really be the real contextual information about what is occurring in that event data. So you almost always have to go down into 90%, probably 50%, anything less than 50%, you might actually be building a very specific use case as opposed to a more generic base search that's gonna cover your 80%. So here we've got fields, fields, fields, tables, tables. At the very end, then you clean that up, right? I actually don't need the fields minus anymore because I already know what's trash. I've already figured out now my table command at the bottom, these are all the things that I've normalized and I wanna keep. And I've just got my evals and regexes and renames, et cetera up top. And then this right here, the way I like to do it when I table out the very, my very last table command, I'm gonna start with my most important fields to the left. So for me, that's always gonna be a timestamp, the system generating the event, the type of event, and then the outcome. So in this case, we see the time, the host, the signature ID, so this is this month, so it'd be like event ID five. And then I've got the signature, which is the human version of whatever five is, direction user. These ones at the very end, like command line, file pass, I think I was getting all the way down into like the 1% event coverage, but they were still kind of important enough and I had enough kind of room in my screen to go ahead and add them. I really don't like scrolling left and right as a bonus. If you're building a dashboard, I really don't like scrolling up and down. I try to get my dashboards to fit on one screen because people are never gonna look at the stuff on the bottom, maybe like a half a scroll. And then again, I'll validate. And then this is where, this whole time we've been working with a static time range, and that's gonna depend on your dataset also, right? You might run a search over seven days and it comes back like that. You might run a search over 15 seconds and it seems to take forever. So things like firewall logs or I don't know if any of you guys are familiar with Proof Point, the mail gateway, that stuff has so many logs on it. You can hardly run a 30 second search. But I don't know what, like your AV. Your AV, you should be able to run seven days, hopefully, and not have a ton of data, right? So when you spot check, when you've got your base search, you think it's done, expand your time range. Or if you've used Event Sampling to try to make your searches run faster while you're working, remove the sampling so you get in the full dataset. And then you wanna check again, right? Are all your projects still working? Are the fields that are supposed to be no, no? Are the fields that are not, not, et cetera. And then you can deploy and publish your base search. So I should have had a snip in here, right? Cause all the stuff in the middle is missing. But well, if you put it into your props.com, then you actually wouldn't have all the stuff in the middle. Cause the system would do all of the work here, right? So from the rename down to the table, that you could put into a props.com file. And then the system would do that for you. I like to actually save them like this because you're more advanced users might actually need to see all, maybe they don't trust you, right? Like I don't trust anybody else. Maybe they don't trust me either. And then that way they can see my work. It's also helpful because then, if they wanna start learning to do this work themselves, they've got some cheat sheet code, right? It's like a private stack overflow for them. One, I gave this talk to a deeper group in San Diego and one of the women there, she does this, but what she does is she puts these into the macros. And I was like, oh my God, they're so smart. Right? Because then people just know like, oh, Sysmon Base Search Macro or Palo Alto Base Search Macro. So I thought that was really smart. And I stole it from her. What I would do is just bookmark that I hyperlink and be like, here's guys, bookmark, save these. But then that's where you can actually start to do some analytics, right? Now you've got your base search. So this one here, we can actually craft a command that will, or just craft a search that will show us which system was getting ransomware at the time, put in some threshold, build an alert and if something like flipped over that alert which this system would have. And that's sort of how you start to do your detection engineering. Okay, so 453, does that mean I've been talking 23 minutes? Let's do this first. So here are my resources. Now, again, if you go to Splunk, their .conf, the conference watch online, this talk is from last year. It is recorded and they have the PDF of this also. So you can get all these hyperlinks that are clickable if you go there. Otherwise, you can take a picture as well. You can like Google this stuff. Although I made these friendly so you don't see the real URL. But these are clickable if you go get the PDF off of Splunk's online recording. Again, this whole talk is recorded there as well too. So you can like replay it if you want. Let's do some questions. And then anybody who wants to believe, obviously you could have left whenever you wanted to. So we could do some questions and then we can do some demo of some of this if you guys want questions. Yes, so we can show that in the demo for sure. Into the UI that way. So, yeah, no, no. Should we, you guys just wanna look at a little bit of a demo? Okay, I think it'll just log me in. Oh, no, it won't, okay. This security dataset project, it's public. So you just come in here. You see I've used it a lot. It's actually pretty cool. So I don't know if any of you guys are familiar with Splunk does a boss of the sock competition. So it's like a blue team capture the flag. This dataset is from their Bop version two. You know what else is really cool? Did I have it on the slide deck? Oh, why did I close my slide deck? That was dumb. My hands. Yes, so you can actually, this is free as well. You just log in, sign up and you can actually go do the capture the flag on the older versions of the dataset. It's pretty cool. Will it take her along to sign in? All my business up on screen chair. Okay, here we go. Oh, here we go. Yeah, so you get the challenge questions and you can go through. So it's really pretty cool actually. So that's freely available. All right. So this is the actual dataset project and if you pick the boss of the sock one it will actually step you through some of the exercises here as well. So we're gonna do that. Oops, what happened here? Let's just go to my bookmark. Okay, I guess it's not horrific. Is that a little bit of, you know, I'm off the side here is kind of the thing. I hate to try to mess with it. Oh, it's like super big. We actually do need this side though, more than the other side. Okay, there, that's a little better. Okay, so here I've got my index and my source type, right? And you should pretty much never ever use all time but I'm gonna use it right now only because A, this is a demo site and I just wanna show you like how I go about, you know, if I'm gonna be like iterating this command over and over and over again I don't wanna sit here and wait for it, right? But I need a good representative data, a good sample of my data to work with because if I'm, you know, if I'm only working with 5% of my data that's all similar then all the work I do isn't gonna match but this looks promising, right? Oh wait, I was gonna pick that big spike. We'll pick this one. So we can zoom to the selection there and then this will help us get something that'll run more quickly and again I'm running in verbose mode. What that does is that gives me all of these native field extractions here. If we run it and I'll just show you guys since we have time, right? If we run it in fast mode which is what they'll tell you in like Splunk trainings, right? Because they already know the data, what they're trying to teach you is how to use the commands but you don't get all those field extractions out here. Well if you already know what fields then you don't necessarily need them. But that was one of the things that always kind of left me feeling confused when I first started taking the Splunk trainings. I'm like wow, do you know it's access combined equals 502 or whatever. All right, so verbose mode, this is still kind of running a long time so I'm just gonna zoom in again and you can see, you know, we've got this static time range here. We're not using a 30 second window or the last 15 minutes, which right, if we ran it a minute later it would be a different data set. We'll zoom into this one and then hopefully this should run pretty quickly. My phone is holding up pretty well. Okay, this looks pretty good. What I'm gonna do, I almost never ever use this but just to get this to run a little more quickly I'm gonna use this event sampling which will just give us one out of every hundred events so that came pretty fast. Okay, so index source type, we've got verbose mode, we've got our static time range. There's a whole ton of fields over here so what are the important ones? But what we can do first thing, right, you guys don't actually have to watch me type because I saved all these, right. Here's our Splunk noise. This I could try type almost by memory really fast but that gets a whole bunch of the noise out of here that we don't need, perfect. Now what? Now I wanna focus on 100% of my field prevalence. So I'm gonna go down here to all fields and you've got this filter here, right. So if we go to all, you can see there's, you know, 0.12%, 167, we'll wanna work with these, probably these 30s down here but let's just flip it to 100%. And this is kind of tedious but honestly it's like the best way I figured out how to do it, I just copy and paste this table. And then you do have to go through here and be like, no, no, I won't make you watch, you guys get the point, so there we go. So now why do I do that, right? Because when we table out these fields, now we can start to look and see what's interesting and not interesting as opposed to this, right, where we've got the event here. We'd have to like click through every one of these and try to, you know, it's just not as workable. But so we table these guys out, right, and we can start to see some things right off the bat. We see we've got what, one, two, three hosts, four host names in two different formats, event ID and event code, which, you know, are duplicates of each other, some more stuff over here. Oh look, we've got, what are these? Event code, event ID, signature ID and task. So we've got four fields that are the same. So we can start to get rid of some of that stuff. But what we need, I won't make you watch me Google it, but let's pretend I have the SIM schema up for endpoint stuff. I happen to know off the top of my head that signature ID and signature are the fields we're gonna wanna keep. And that task, event ID and event code are gonna be trash. So I just wanna start some new trash piles. What did we say, task, event code, event ID. Oh, and then, what did we say? We said signature and signature ID we wanna keep, right? So even though this is gonna look like a trash pile, it's not really, but I wanna keep it separate. I wanna keep it separate so I know that I wanna keep it, but I don't wanna look at it anymore because I've already seen it and I've assessed that it's good to keep. So I'm gonna do a field minus again, but I just know that this is eventually gonna turn into my final approved table list. And then we can get rid of this here. What else, we've got rid of task, event code, event ID. All right, let's double check though. Let's see what we got left now. Or let's maybe deal with these host names, I guess. Field, fields, plus, what was it? Computer, something, host, device, something. That should give us four. Yeah, there we go. All right, cool. So you see we've got the skdn and then the short version. I personally like the short version because it doesn't take up as much real estate, but it's also gonna depend on your environment. So if you work in an environment where you've got multiple domains, you might actually need that additional contextual information that includes the domain of the asset. If you're in a single domain, I personally would go with the short version because you know what the domain is. But we're gonna go with, we're gonna get rid of computer and get rid of host. And I'm gonna put them up here because they're not Splunk Trash, which is right here. This is kind of just like extra event duplicate trash. Computer, host. Okay, so then, but do I need two fields for this? Or do I just want the domain? Like maybe I just keep the host name, maybe I want two fields, maybe I want host name and domain, or maybe I want one field that's just the skdn. I don't know what you guys vote for. You want two fields? Okay, so I'm gonna do this. All right, this is gonna be, I'm gonna put this right here because these are my two lines that I'm eventually gonna be my official work. V-Vow, NT host equals upper vice NT host. Why? I don't know because I was taught a million years ago that host names are uppercase. I think that's a Windows thing, right? I know we're at a Linux conference, so very good. NT, and then we need domain, so we're gonna do this. All right, now how does this go? It goes quote, parentheses, question, crocodiles, quote, parentheses, or whatever, parentheses, quote. So we're gonna call those device NT domain. And we want, oh dear, everything after the first dot. So it's gonna be, oh my gosh, regex live in front of an audience. Whoops, we don't want those. Alpha numeric and dashes, I think. That can go in a host name. Anybody else remember what else? I think that's a slash dot. Well, I was thinking I was gonna start from the end and go up, but I think it's easier if I go from the front and down because I've got, oh, where it's like knots instead of assertions. You know, I've got it in my slide deck. I think this will work. Oh, we gotta put it in here. There we go. And then I actually do, I'm sort of neurotic about the order of my commands too. You guys obviously don't have to be this neurotic, but if I'm gonna do one upper, I'm gonna put the other upper next to it. So eval, actually I wanna do it like this. Eval, device, NT domain, and those upper, yay. All right, I'm happy. So now we can get rid of this. We can actually take this out because we're done with the devices and we can add them to our list here. NT host, device, NT domain. So you can see here, right, at the very end this field minus is gonna get turned into a table command, but we can see what we've got is we've started to build our search here and we just, for our own brains, like kept all of our trash piles and everything else separate. Cool, so now we can see, oh, you know what? So I don't know, I keep talking about Windows logs. I maybe should have picked a different dataset. Thank you, I would have to redo the whole talk though. I don't know how many of you work with Windows event logs. This security ID is usually the user account that is performing the activity. So it can be an important account. I think it's like 18 or 19 or 20, I forget which one is like the system admin account, right? But in this particular, so normally I would keep it, but if we take a look, will we crash the browser? Let's try it. We can see here that every single event, so of the 1600 events that we're working with, now I normally work with a bigger dataset than this, are all the same exact security ID. So it's the same user account. What that means is this is actually the account that is running Sysmon, which is collecting the event data. It's not actually an account that's performing the action that Sysmon is collecting the events from. So because of that, it's actually a useless field that I don't need cluttering up my UI. So this was a different dataset, right? If this was still Windows, but it wasn't Sysmon, it wasn't this ransomware, well, forget I said ransomware, if it wasn't Sysmon, right? That would actually be a field we'd probably wanna keep. We also see right here, we've got two timestamp fields. Now when you do a stats value, stars star, oh, and I don't have it in here. Underscore time is the field that Splunk uses to mark the timestamp for an event. I didn't have it here in the UI, but I can see right then, then I don't need these extra time fields, right? So we can put them again up here in the trash. We don't need our, well, we can keep our stats values. So we don't need, so event channel, same as our source type, we can get rid of that. We like the event description. This is actually helpful, although, didn't we see that? You know what, we did see that already, I think. Let's get rid of event channel first, and let's give us back our signature fields. Oh, here they are. So, what's interesting, okay, this is a good example. Does signature have more things or does it just look longer? Driver load, file create, image, network, oh no, it's the same, it was just longer because it's a shorter field. Okay, great, so we can get rid of event description. That's a duplicate also, and we can put our signatures back up here. Dude, we got rid of security ID, right? I definitely don't care about whatever version Sysmon is running. I'm not the Sysmon administrator, I'm the security somebody. Record ID, I don't know. I don't think it gets me much of anything that I need to worry about it, but you know what, maybe I do. I don't know. I'm gonna make this pile is gonna be my pile to go look out later. Like, this is my just-in-case bin. Oh, keywords, useless. Again, it might be something that we wanna translate later, but even if this was a really interesting piece of information, if it's the same piece in literally every single event, which it is, because we only have one value, like how valuable can that actually be? You could literally put it as the title of your chart and then, like, you don't need that field. We'll save it right here anyway. All right, keywords, where are we at now? Level, opcode. Action, good, direction, good, process, great. So, if we took this down here now, we basically have almost the start of what is interesting about this data. And then, basically, we just do this again now through, well, we won't do it again, but, so what we would do is we'd come over here now and we'd start to look at all the fields with 90%, right? We can sort this. We know we already looked at the hundreds. Oh, that's all we got here, okay? Let's go to 50. Here we go, right? So, we would just take now these guys here, yeah. Right, and work through that again. Eventually, what we get to then is, here's our 90%, right? So, we have some signature that we liked. And then, all of these guys down here at the end, these are the new 90%ers that we would walk through that same process. What we get eventually is, so here is the actual base search I ended up creating. You can see I did filter some things out. Like, I don't care about Cessmon monitoring Splunk. That's not useful to me as a security person. I mean, I guess maybe if I was monitoring, like, did Splunk get shut down? I figured out this Acronis. I think this was, this is like backup software. Tenable, I don't need that, right? So, I've got my index and source type, some data that I just don't need. I've got my renames here, so I named image loaded to the sim normalize. I don't even think this is actually a sim field, but at the very least, it's lower snake case. This would drive me completely crazy, like, even if image loaded wasn't a sim field, if you left it in camel case down here in table command with all of the lower snake cases, I wouldn't be very sad. So, and then, right, we wanna remove our sampling. What I did find out when I was looking at this, do we have it, this one, no, here, process. If is null app, return process, else return app. Turned out in like the 100 some odd, 200,000 events in this data set, there were five that did not have a process field once I was done with all my normalization, and I went and checked every field, right? Every single field, I was like, where is null? And then I looked, oh, it's null, it's not supposed to be there, it's fine, okay? Go look at the next one. Where is empty, an empty string, is it null, is there a value there, does the value match, is it lower case, is it uppercase, whatever? What I found were four events out of like 200,000 that did not have a process. When I looked at them, what it actually was, was CISMOM starting at boot time. So it didn't, it had the process, but it was in a different field, monitoring itself at boot or something. But that just goes to show like the level of neuroses that I'm at when I'm like validating this kind of thing. So I'm sure nobody else would have cried about those four events. All right, so now, should we try to find the system that was triggering the ransom, that was being ransomware? So we've got, what do we have? Let's do, what time is it? Five, 19, 10 minutes. Maybe we can come up with something for ransomware real quick. I think what we want is stats, count, let's do count, process, let's see ransomware, what does it do? It encrypts every file on your system. I think it triggers a unique process for each file when it starts to encrypt them. So I think if we count the process IDs by system, by device, oh interesting, I decided not to uppercase this. Weird, see? A little art and a little science. Sets, count, process ID by device. And let's do also this one, distinct count. Fingers crossed, I think we can see which one it's gonna be. Can you guys see that all right? Oh, you know what, what's the time range on the, oops? Yeah, 27,000. But if we were gonna write a search, what we would want, what do we've got here? We're actually looking at a whole 24 hours. We wouldn't want that for an actual search, 4 p.m. We'd wanna do like a one hour window. Send to selection, okay. So if we were gonna write a detection alert for a SEM or a SOC, this would be our base search that we would save and give our users, right? And we could reuse this now any number of times for stuff. All we had to do was throw this one line on the bottom and now we have a ransomware detection or something fishy detection, right? Something that's spawning a shit ton of processes on your system. So in one hour, so we could easily put, we don't know if this one is also, like if it's got something fishy going on, it could, 4,000 is kinda high. And we could easily do, we wouldn't wanna just run this every hour, right? And every hour it would give us this whole list. If you've got 20,000 systems in your environment, every hour you would have this list of 20,000 systems. So we need some kind of a threshold and then we can set that as an alert so that we don't trigger alert fatigue with our SOC. So we would do where? PID, greater than 1,000. Man, 5,000. We need a better data set, or not a better data, a larger data set in order to pick an appropriate threshold. Right now we're just looking at three hosts in an environment. You would look in your environment, hopefully you don't have active ransomware going on and you would be able to see what's normal. There might be servers, web servers that are really busy, so they might get their own alert because the threshold might need to be 10,000 is normal. But with your user endpoints, the threshold could be 5,000 or whatever, right? Where PID, greater than 5,000. And then we don't need, wait, we do, yeah, distinct was not. That's interesting, but also, maybe we also need to count maybe not, maybe just the process IDs is being lazy. Is it also gonna be like the, well, it wouldn't be the process, but maybe, do we have the target file? Let's also do that. I'm gonna do, actually, what was it, file path? And then I also wanna do this because I wanna see if all is, okay. So we've got only, okay. So this is not, file path is not a good indicator of anything either. So maybe PID is the only thing, right? Maybe this isn't the best alert, but I mean, it definitely does, I'll guarantee you, that's the host that got ransomware. But you just have to kinda poke around through it and figure out what does the data have that's interesting that you could use. The thing about security, detection engineering, like nothing is ever like, hey, I'm bad. It's like all normal activity that's just abnormal and that's the hard part is figuring out, no matter what search you write, it's always kind of like a judgment call. What's the appropriate threshold, right? You don't wanna blast your sock with like a thousand alerts a day of just trash that they gotta look at, but it's inevitable that at least half of the things they do look at, they're gonna be like, oh, it's normal, right? And you just can't suppress and whitelist everything. But that's pretty good, there you go, we caught a ransomware. All right, I think that's it. Comments, questions. Oh wait, hang on, I forgot one last slide. Lydia's favorite slide. Here we go. She smiled. She smiled. You heard it. I turned around, she was smiling. Oh yeah, yeah, everybody was, it's that point. Yeah, I'm all driven, what do you think? Oh, I didn't know, I was looking for it. And then I said it's too big, I'll just shrink it. It's actually shrinked and then other things. Yeah. I know. It's nothing, yeah, I get it. It's a wonderful community. We were wondering why things might be so, unless you know it, Asha, Asha, same last name? Yeah, good. So when I know there's a lot of noise, it's gonna be up at this level. But it's just too many things that were similar. People that would be, yeah, I'll be around. I don't know if I'll be in the hotel. And probably at the bar and stuff. Right, yeah, I always stay in the hotel. The one that we have here is the one that we're having on the cover date. Should I unplug the plug in? You can give it a test again, Asha. Okay, perfect, all right, let's get you this. These are gonna go behind. I also didn't want to do it. It looks like you can put it close to your mouth. I'll take it now, raise it to the back if you can get somebody to do it. This is the lizard head in order to, you have to, where's the lizard head? Hello, hello, can you guys hear me? Okay, perfect. Guess I'll turn it off for now and then turn it back on. Is something on the call? Yes, that's the camera. Oh, that's on the first call. I thought that's far off, it's all in the dark room. Well, it doesn't matter too. It's all in the dark room. So, are we just further on there by the time? Yeah, I think so, I don't know. What about, right, into the room? Rock? Sure. Yeah, it's on there. Yeah, it's on, I just wanted to turn it on. It's not on right now. Testing, testing. Okay, good. So, hello, everybody. I'd like to introduce our speaker tonight. And if you're playing the Exibitor Hog Passport Game, you'll find in the first call, number three, Datacom LA was called BigData LA, changed its name in 2018. The conference, that wonderful conference is happening on the USC campus in about two weeks from now. I'm sure it's, I've attended that conference and it's fantastic. I'd also like to mention that Shubhash, that he's the founder of Data for Good, a public and private consortium for using data to solve social challenges. I got this, thank you. Thank you, Ron. Thank you all for joining me tonight. So, today I'm gonna be talking about modern data engineering practices and how we implemented an architecture that works at the CSU. So, we'll start off an agenda. I'm just gonna go through this quickly. I'll just talk, give you a brief overview of who am I, then talk about the CSU, and then I'll start with the data lake architecture that I worked on from lending at the CSU and go over some of the key features that we implemented. Also, a lot of us do scale, grow rapidly, build multiple use cases, and now we are one of the key focus groups that multiple other groups use us for leveraging the skills and technologies to bring their use cases to the forefront. So, who am I? So, I'm the director for cloud data engineering at the Cal State University, office of the chance, hello. Sorry, it's going, it's forward. I'm also the founder for Data Connolly, which is the largest data conference in the SoCal region. In fact, the conference is in just two weeks, right, two weeks, August 13th. We're gonna be at USC, and then I'm on the founder for Data For Good, which is a nonprofit using data for solving social challenges. I was recently awarded the AWS Education Champion Award. I'm going to Seattle next week to get the award and citation, and then I was on social media members of ACM and IEEE. So, CSU, so what are we? So, it's a public university, probably a lot of you have heard about it, and it's the largest four year degree in the nation with nearly half a million students and about 50,000 faculty. So, I'm the chancellor's office, which is, we're not actually a campus, but we're like the hub and spoke for all 23 campuses, and we primarily work, you know, gain the data from all 23 campuses, we process it for them, and then we send it over to them on a daily basis. And then every year we award nearly 100,000 bachelors, masters, and doctoral degree. So, with that, let's get started. So, the first thing I wanna talk about is data trends. So, one of the key things that we have noticed over the past decade or several decades is the fact that data has been growing exponentially. I mean, you've seen this just in the daily trends. I mean, the amount of data that your phone collects or the data, you know, watching Netflix or your browsing a web, there's a lot of data that comes in. So, using that data, I mean, people can make recommendations, do a number of things that kind of make your life, suppose we make our lives easier, but so we can make it complicated, you know, with data privacy challenges, you know, your data getting hacked for one thing, data getting being misused or being sold to third party companies to deliver ads to you. So, there's a lot of pros and cons with data being generated. So, these are trends that you can see, you know, the exponential growth, the new sources that it comes from, the amount of diversity comes in data, and then the fact that we use by multiple people as well as many applications. So, what does that mean? So, it also means kind of means that people need to have a way to process the data. Someone needs to actually go in and actually create an architecture platform for creating meaningful results from that data. So, you kind of think about this as companies moving to data lake architecture, which kind of means bringing the best of both worlds. So, if you think about data warehousing, which was the paradigm, you know, 20 years ago, even today, I mean, still see a lot of data warehouses in scope, but we need to move beyond that. I mean, the fact that data warehouses work very well for structured data, but don't handle semi-structured and structured data as well means you need to have a scope or mechanism to handle that data in a better fashion. Plus, when you talk about data scientists, they always are saying, you know, we need clean data. We need our data to be scrubbed. A lot of that, you know, obviously can do some of that with your data warehouses, but not all of it. And you want to do all of that cleaning and scrubbing at the beginning of a pipeline versus towards the end, because when the data lands in a data warehouse, it's already been transformed. If you want your data scientist to make maximum value of data, they prefer that they get access to the data sources and part of that is making sure that the data is clean for them to use. So what would that mean? So it extends and evolves, you know, data architecture stores data in any form. Matt, sorry, this thing is, look, I can turn my head with this. My headphones out. Something is, give me one second. And then you can run any type of analytics from data warehouse to predictive. So I got this image from somewhere, I liked it. So it says, welcome to the data lake. So I'm gonna talk to you about, you know, what this means. I mean, from future perspective of how we combine structured and structured and some of the structured data, how do you build something that can make sense as you grow? So the concept of our data lake came about, you know, maybe I would say five to six years ago, they kind of extended the data warehouse and said, you know, let's talk about bringing, some of the structured and structured data. So things like, you know, object storage or things like your S3, AWS S3, how do you combine that with your data warehouse to make that more of a system or platform from where you can extend and evolve your data architecture? So you talk about things like, you know, your BI, your reports, data science, machine learning. And you combine all of that into your data lake. But over time, there's one thing that kind of people are also sort of asking, like, you know, this is great. I mean, now we bring in some of the structured and structured data, but we still don't talk about how do we, you know, clean the data, master the data? How do we actually make use of the data? So one of the things that people started discussing that, you know, we need to bring some kind of data governance into the frame. Data governance is basically this concept that's been around for, I would say, maybe decades, maybe two decades at least. It basically talks about how do we ensure that we have a consistent data philosophy across both, not just within the group that you are in, but across multiple groups. So things like, you know, master data management, meta data management, data lineage, data quality management, data catalog, data dictionary. These are key fundamental things as data governance tools need to be brought to the forefront. Today, most organizations have a very, I would say, lean or non-existent data governance framework. And that's part of the reason because there's no tools that are out there that support these kind of tools, tooling around. We'll see bits and pieces over there. Even most clouds today, you will see them or support data catalog. That's about as much as you would see from there. If you look at third party tools, I mean, you probably see bits and pieces out there, like you probably see someone just do MDM or someone just do meta data management, but nobody does the gamut of that. I mean, one of the closest tools I've seen that's come to picture is Informatica, but that's super expensive. Colibra is also another company that's kind of working towards building an enterprise or data governance tool set, but again, it's bits and pieces. No one does everything. And the last aspect I want to mention is also the fact that part of the reason it's also very difficult to do is the data silos within organizations. Most organizations have, you know, multiple groups. All of them have their own data sets. All of them have their own data dictionaries, which they don't share across organizations. So I'll probably talk to you with the concept of data mesh. This is basically the talk about bringing together all these data silos in some way or form and combining them so that it doesn't become a silo. So data mesh basically is kind of like an abstract layer that's built on top of your data silos that supposedly makes everything more visible to all of the users. Again, it's not completely open. People do talk about it, but at the same time they, getting that to something works is something that is rarely seen. Let me put it that way. So the next piece that, you know, when I started at a child state university, part of the reason they brought me on is because they were just starting the cloud journey and they wanted to have something that's more cloud focused. Now, I inherited a team of engineers that were all, you know, SQL engine. They were pretty much spent decades writing Oracle SQL and for the most part that's great, but one of the key philosophies that I kind of developed over the years is that I like this terminology called DevOps, which is kind of built on the DevOps philosophy. It has been around now for, what's the 15 plus years, but the key thing I liked about DevOps is testing. And I saw one of the key things is that, you know, this was a good way to start, you know, building the team to understand how CHU can work in an organization such as this. Now, CHU is very legacy oriented. So all of the technologies is legacy based. You'll probably see more Oracle and people soft, that it is anything that, you know, kind of modern data engineering that's out there. So the first thing that we did or override it was like, I started retraining my group. So we, I reskilled, upskilled them to start saying that, you know, how do we build something like this? So we need to come up with a platform that's scalable and can evolve in a very structured manner. So we started looking at, you know, building multiple environments. So we now have a dev and a product environment. We also have a staging environment, but that's more of a launch as you need and shut down when you don't need it. But the good piece of that is now we actually have an environment that makes it easier for us to write code. So the other thing I did was I searched everywhere from SQL and we still use SQL today. We don't get, we haven't got rid of SQL, but the fact of the matter is SQL is not the best mechanism right test cases. I mean, if you have written SQL, if you have looked at SQL, you probably see, you know, complex SQL statements that are very hard to debug, you know, especially when something happened in production, you know what happens. I mean, where do you debug? You debug and you change the code in production. You see those memes out there. And I've done it before. You always end up in something sales in production, you always debug in production. That's not something that I like. I mean, I've made the team, you know, move completely away from that. So how do we do that? So first thing, you need to move people away from thinking about how we need to be SQL based. So SQL has its place and position. I don't say that you get everyone's SQL, but you need to know where you need to use it. And you also need to know that SQL, even though that it's been very complicated SQL statements, it's not the best case to identify and debug situations. And if you do write SQL to me, I always push my team to make sure that they modelize their SQL statements, break it down into more functional logical equivalents. So they're gonna add test cases against that. That really leads me to the other part of the statement. So what do we move from SQL to it? So most data engineering pipelines today kind of move to Spark. And that's the kind of platform that we use. If you ask data engineering platform, what the tool of choices or the problem is choices, probably Spark and for, I would say 70, 80% cases will be PySqark. Because just because it's more easier to learn, because it's a Python-based development environment. So we went with that. So things that we did, we did deployment pipeline, repeatable reliable process, automate everything. So today we follow continuous delivery cycle. So we pushed to production once a week. At some point we'll go to deployment, but at this point we're focused on just delivery. And then we version control everything. So this allows us to roll back much more faster than we used to do in previous cases. Also production, nobody has access, the admins have access to production. That means all the developers, if something breaks, they actually have to apply hot fixes and roll them out to production in a much more release-oriented cycle. So it's a very DevOps-friendly way of doing things, but it allows us to be more flexible and grow more rapidly. The other thing we kind of pushed the team is to make sure that we write reasonable functions, extensible framework. So everything that we have, we can follow that kind of pattern. So if anybody else comes on, they can just reuse functions from, that have already written, we don't have to reinvent the cycle. So that's one of the things I feel like about how we build this out. So how do you implement this? So first thing that we did is, I kind of told my team, we need to break this down into compartmentalizing into divisible portions and then work on them in individual components. So I don't know what this is. The first thing is the ingress part. So first thing I said is that you need to have a component or a framework that can ingest data from anything. So today, most of the data is batched, but we have actually written code that can also ingest streaming data. So we have some streaming use cases that is built in the framework, but it's also extensible to handle most kind of third-party platform. So we have all tools such as MuleSoft, Bumi that can plug into or push the data set. So we can easily use that. Or we could use, you know, DMS, AWS data migration services. We could use flat files. We can use FTP servers. All of that has individual components that we already built in. That could be just extensible, extensible and used by anybody that wants to launch into a platform. So that makes things easier for third-party developers. The next piece that we also did is we standardize all of our file format. So if you land in our FTP bucket, the first thing that we do is we convert them to Parquet. So Parquet is a file format that's used today very heavily in data processing world, just because it's a column of file format allows for more performance-based processing and it gives more structure to the data by allowing you to understand and extract or crawl the data much more easier and apply that to your data catalog. So we see that kind of framework that's built out that allows us to scale. Second piece that we did is we follow, once you land in S3, you can follow our framework for processing. Now today, all of our data, all of our applications are AWS. The reason that we use AWS pretty heavily today is just because of the fact that we have a really good partnership with AWS and we can just use any application within, I mean tools that are AWS without having to worry about licensing. So because being a public university, every time we want to use a third-party tool or application, there's a huge legal wrangling that we need to go through. So that makes life harder for us. So just to get things quickly like launch, test things out, it makes it much more easier for us to start off with AWS and use it as a tool set then work with something third-party. Now that's not to say we can't do open source but my team is so small. So I want to focus the team on things that, we don't spend too much of time maintenance and administration, we focus on actually development and infrastructure and building those pieces out. So you deal the hand that you have been dealt and you focus on things that you want to do, focus on the opportunity that you have and show where you can actually show improved success because that's where you actually can bring value to the team, show us a proof that this can go and scale out and then once you have built a trust and conferencing and actually hire more people than do the other things that you want to do. So you have to like figure out balance out what needs to be handled and you want to show a success so things are really key and important. So how does that work? So from our data ingress, as I mentioned, we have used heterogeneous data sources, we are able to handle databases, flat files, training data, bring that on to multiple interfaces, process the data and store it in S3. So the first piece, you have to learn it S3. Second piece, it has to be converted to Parquet. Now ideally, we would like you to transconvert and drop your data into Parquet file format by default, but if you don't, we have a process that will, a lambda function actually kits of automatically sees if the file is, what do you call, Parquet ready, it's not, it'll actually crawl your file, figure out what the elements are and then convert it to Parquet. So we cannot have automated that piece as well to allow us to more flexibility into understanding what we do. Now a good piece of this is when we actually crawl the data sets, we actually store this in the data catalog. So a data catalog is kind of important because that's what we expose to all the other tools that we can have. It's kind of like a single source of truth. So for anything downstream, that kind of wants to process the data, kind of logs into the AWS Glue, which is our data catalog provider and pulls the schema information. This also allows us, if something changes in the data set, we can apply the changes in our data catalog and any downstream process that users can actually leverage the new schemas and make appropriate changes to the data set. So I'll talk a little bit more about what that schema changes mean in an automated fashion for us. Before we go into this, any questions? I mean, of course. So from a data process, how do we actually apply this? So this is how we actually, so firstly, start with the frameworks, you know, firstly, get the code base. Today, I'm using code commit, which is get a clone. So it's pretty much get, but they kind of wrap it up for AWS and code. Again, I said using AWS because we try to make our lives easier. We don't want to complicate too many things given that we have a small team. So the framework, we follow the branching methodology. All the developers have their own branches and they push into the developer branch. First thing, they need to merge. When they merge, make sure there's no conflicts. If there's conflicts, resolve the conflicts. We have our regression test in place. So all of the code needs to go through a regression test. I can make sure it doesn't break any of the other code. So if it breaks, they need to fix it. But a typical good thing, the last person who checks it in, they're responsible of fixing all of the, any of the code breaks that happen. And it doesn't get actually merged into developer and it actually fixes all of that. The key piece that we do over here is unit testing and integration testing. So the other piece that we want to make sure that all of the code has some kind of testing in place, we have a minimum threshold today of 30%. You want to move to 90% over time, but if your code coverage does not meet that threshold, it does not get pushed into a developer repository. So we want to make sure that developers go over time. They increase the number of tests that they include. Today we follow happy path medium. I mean, there's this thing called TDD and BDD. We focus on BDD because we're adding all kinds of, you know, testing is really hard. So you follow at least the most common path for testing. So you can ensure that it follows at least some, my method of, you know, testing. Even though there are some, we had some more user test cases based on edge cases that you've obtained of OC code that happened during processing of data in production. Double key things that happen. Commit pull request if there's somebody has to actually ensure that. So we have another developer actually verify that the code looks good before you approve the PR. Once the PR is moved, we merge the dev branch and then somewhere it goes to our master branch. We're working on, we have some containers in place. I mean, working on that, you haven't fully formalized the container approach for today. We're just using cloud formation, which is Terraform clone, which allows us to push into our production environment. So part of the growth that is also ensuring that we have our Kubernetes containers in place at some point so that we can scale this out to other environments. And it's kind of important because today we're using AWS EMR, which is Amazon's data processing pipeline framework. But we've been talking to Databricks, which is one of the big data processing companies out there. And we probably might switch over again, still in talks as in search to Databricks. It means that our container philosophy has to be more in place before we do the transfer into Databricks. So how does the test automation work? So developer creates a pull request and notifies the approval for the PR. And then it's checked for, are you gonna approve the pull request or not? You run your inner testing and code coverage. And then if something, if it's not approved, again goes back through cycle, developer has to check the code. Oh, sorry. Okay, sorry about that. Yeah, sorry about that. Then obviously, so today we're using step functions. Now this is another Airflow. If you've heard about Airflow, that's become the default scheduler in most organizations. But we're using step functions today and our plan is also to migrate into Airflow. AWS supports managed Airflow. And we plan to take advantage of that in the near future. One of the reasons we didn't switch to Airflow today is because it still does not integrate fully well into EMR. But if you switch into Databricks, you may have a better option, opportunity to use Airflow. Though now Databricks had their summit a couple of weeks ago and they announced their own workflow scheduler. So that's the other problem with all the other tool sets that everyone, everyone brings their own tool sets into picture. So the other thing twice, should I use what's common out there? Should I move over to the tool sets offered by some of these other companies? Now Databricks claims they have workflow scheduler and better it has better integration. But it's a caveat. They want to tie it down into their own framework. So you ought to think twice, do you want to stick into something that's proprietary or do you want to move into something that's open? Now Databricks always claims it's all open source. You can use it, but if you look at the codebase, 80% of the codebase is always developed by Databricks. So again, you ought to think, okay, is this really open source or is it just Databricks supporting the, what do you call application codebase? Food for thought, something that you want to think about. So test automation, so how do we do testing? So today our testing is done with PyTest and Chispa. PyTest is obviously a testing framework for Python. But when it comes to testing PySpark, which is what we use today, there's no good framework out there. So we found this open source framework called Chispa that does a pretty decent job of testing PySpark code. So we started using that. And we used that in combination with PyTest. And then the test coverage is automatically collected and calculated and because we already kept, put a threshold, but doesn't mean that threshold doesn't get pushed. On the other side of things is the egress workflow. So one of the things, question? I have a question. Go ahead. So there is a demand for... Yes, that's correct. So, no, no, so we have product and dev environments. So our dev environment is, it's a call of us, clean or encrypted or reacted subset of the data and prod. So if any level one PI data is there in our data sets, I get scrubbed and clean. And we put only, we have a data retention policy, which is smaller than the production. So production data has everything from all time, but dev is basically probably only about, we did seven and 90 days depending on the data set that we collect, we put a subset of that. But we also redact and encrypt any PI level one data for your testing and development. But that basically allows us to create a code base to test against. Good question, anything else? So the other sort of things, then we have, you know, how does data get pushed or extracted for our BI tool? Today we support two BI tools, which is one is called AWS QuickSight, which is a Tableau clone, but very poor Tableau clone we put in that way. And we support Tableau too. So again, we had Tableau, but one of the reasons that we didn't go Tableau first because the Tableau licensing was a nightmare. Tableau took actually took much longer to get back to us to confirm that they wanted to work with us. And then we were getting, we were getting, we had a delivery to meet milestone to Matt, to be Matt, and because Tableau was refusing to work with us, we kind of just went with QuickSight. I ended up having six months later, Tableau agreed to our licensing, but by then we were already in QuickSight. But few other groups out there use Tableau, so because we need to support them, kind of like the main data processing group, so that's why we extended it to Tableau to users as well. Now we don't officially maintain Tableau, so they still need to have Tableau licensing and servers in the RN, but they can connect into RS3 buckets into our Postgres servers to actually extract the data. That reminds me, we have Postgres today, Aurora Postgres, but we've been talking to Snowflake and we are probably migrating from Postgres to Snowflake in the not-so-distant future. So from this, we have a Glue catalog, so we curate all of our schemas into that. This also allows us QuickSight to extract all of the schemas without needing to have separate Postgres tables also are able to load the data of schema information from our Glue data catalog, so it's one source of truth for all of our third-party applications. The third piece that we also use is Athena, which is a Presto clone, allows us to write Presto SQL on top of S3 without having to actually store that into our database or data warehouse. So we use that mainly for data sets that we don't load into Portray. Our Postgres data sets are mainly smaller, like we use only Transom data sets or the use case focused data sets. So we don't put all the data into Postgres, because we don't want to overload our Postgres tables. It kind of makes sense. And for performance-based, we kind of use Postgres mainly for performance. If there's query that needs more performance, let's push it, I pushed that into Postgres. We'll kind of follow a similar pattern for Snuff Lake. But if you're doing add-out coding, which doesn't, which means you don't need like high volume or high performance, and then it can stay in S3. You can pull your data from there. Obviously you can use Athena for that. Athena connects to QuickSight, even Tableau supports it, so that allows you to pull back. And the response time is not too bad. I mean, in the past history, it maybe takes a minute, but now it's like seconds, maybe sometimes 10 seconds, but it comes back pretty quickly. You don't have to worry about it. So it's not too bad. So a lot of time for add-out coding, we focus on using Athena instead of going through Postgres. So what does that mean? Oh, what's my time? So I'll make sure I have 30 minutes left. I have 30 minutes remaining, so we can go through this. So in order to data to be analyzed and useful, it needs to be related to each other. So this is one of the key components that need to make sense. Even for your data science team, your ML teams, when they talk about cleaning, scrubbing data, they always ask about how do we get, how do you build that can relate to each other? How do you make sense of the data that you want? So how do you understand that relation that exists between them? Part of that is having schemas that you can build that relationship against. Obviously ontologies can make sense. You can build your metadata layer, connect your metadata against your metadata to understand what the relationship exists between the tables. So if it's kind of like your foreign key associations is what you would think about from a metadata ontology perspective. So until you meet these two conditions, and this is something that I took from the books by Bill Inman. If you're familiar with him, he's called the father of the data warehouse. And him and what's the other guy named Randall, he's a Randall, for what is his name? The snowflake, the star schema? I forgot his name, what is his name? Those two are considered as godfathers of the data warehouse. So Bill Inman said this, the data lake turns into a swamp, and the swamp starts to smell after a while, which is kind of true. If you don't build something that you actually extract value from, then just useless. So how do you go from a data lake to a data lake house? So for the most part, you have the analytical infrastructure in place with Quicksite or Tableau, sourcing from various data sources and any stage and report on as well as ability to store structured, semi-structured or structured data. But then you need a key component that would enable a true data lake house. And that's when you talk about enterprise data governance. So I talked about this a little bit earlier about having this whole framework of data governance, but this is kind of like the standard for what that means. So multiple things come into data governance, things like data quality management, data architecture, data development, operations, security, metadata management, master data management, and document content management. So if you take the whole what do you call ecosystem of all these components that kind of builds your data governance. But if you think about it, I mean, these components are not easy to build. They're not even, it's not even like you put an application there and supposedly will bring all this into a framework that you can actually use. You actually have to spend hours if not days and you have to have a team that can work for each of these components and build them out. A lot of times you probably heard of this term data steward, which is supposedly handle or direction the people bring people together to build these components up, but even the data steward finds it hard to build each of these components. And one of the big, major reasons is the data silos that exist within organizations. Trying to talk to one side of the organization and ask them to share data, like trying to pull your head out of your head and get them to convince them to actually work with you. So challenges with data governance, poor data quality, cost real money, process efficiency, negative impact by poor data governance and the benefits of new systems are not realized because again, your data governance in there. So if you think about this, let's say you build a data catalog, one of the best data catalogs out there, who uses it will be, will obviously make a lot of use of it. But the problem is your data catalog only stores data about your system. There is no information about your systems in the other parts of the organization. So let's say you're starting to come up with a new project or a new use case because you haven't looked into or delved into what the other systems store. You probably may be reinventing the wheel because you probably start building a schema which already exists in some other part of the system, but because you have no access to that or no insight, you're actually rebuilt that whole schema into your data set. You're not understood the nuances of the schema, why some of the changes being applied in the other side of the company and that makes you go through this whole cycle of understanding and trying to come up with your situation that's similar to what the other team has built. So that's a very key component in understanding why most systems fail because they have no insight into silos within other part of the organization. So objectives of data governance, inform data management decision making and show information is consistently defined, well understood, increase the use and trust of data as an organizational access. Now this requires a lot of what do you call support and tie in from your higher ups. So not something that engineers can do themselves. Building trust is something, it has to be a talk down mentality. If you try to bottom approach it'll never work because you're asking other silos in the organization to work with you. And for most part, most silos will refuse to work with you because it's not in their benefit. If you think of a job security, right? In the end of the day, if that silo is working, just making sure that they are doing the job, they don't show the information, they feel that they're more unlikely to get laid off when something does happen. So unless the management mandates that all of the groups kind of work together, you know I'm going to see that. So then the flip side of this is management looks at the, on the bottom line, does it make money? And most of this is back end stuff. You probably won't see dollars, you won't see any kind of value out of it until you spend time and money and effort and probably take maybe take a few years to see value out of it. So you will also see that pushback from management saying that this is a waste of time while you're even doing this. And that's another reason that you also see this fail because the amount of time you need to spend in this, the amount of value that comes out of this is not valuable in the near term and you need someone in the management that understands it's the long-term value of this. If you don't have that, it's hard to build use case or support for these kind of use cases. Regulatory compliance and then eliminating data risks. So before we get into any questions so far, so how does this work? So this is the kind of the system I've been working on this for several years. I kind of formulated this as an application, a framework of how data governance can work. At the lower level, obviously you have application DBs or object storage as well. You can have source and architecture services on top of that, but then starting from that, your main framework would be your data security. So things like your compliance, anything regulatory, CCPA, GDPR, whatever that may mean, is applied at the base layer, component of your organization or your data framework. Things that should work across the board, auditing and workflow, so you make sure that whatever happens, you'll just keep track of that. So anything, who touches your data, when it was touched, people who touch the data do have the right permissions to data, all of that needs to be audited and stored somewhere. So you can keep track of that in the future to make sure that the right access is being provided to the right person at the right time. On top of that, you have data catalog. I talked about this. Data catalog is kind of like single source of truth for all schema information. It allows downstream processes to pull that data, pull the schemas and appropriately analyze and make changes to their data sets as they move in the pipeline process. It also allows you to make changes on the fly and notify your downstream process of those changes. One thing that you've done with our data catalog is that depending on schema changes, a lot of the new systems, they do it, but we handle it a little bit. Additions, when new columns are added, you typically don't stop the process. We kind of still flow, but we can notify our end users that there is a new column in place. If you want to use it, you can use it. None of the downstream process, you cannot use select star. You have to use name columns. I mean, it's become a typical best practice. If you're doing SQL, you have to do individual SQL columns. Don't do select star because that will break your queries downstream. But if columns are modified or columns are deleted, you actually kill the process. You notify downstream process that there is critical component change. Talk to, this typically happens when your upstream providers don't notify you. So because our upstream providers are sometimes third party providers, even within the company, people in other parts of the company won't tell you when they change the schema. That's one of the key things, we have noted as well. So they don't tell you changes happen and then something breaks and then after you go back and say, hey, why is this broken? Why are you making changes to the notification? Oh, we forgot to tell you. Those are things that you have to deal with, but because the system is built to handle failure in place, that's one of the things it's, we look at something called graceful degradation. If you fail, you need to fail gracefully. You cannot fail abruptly. We should be able to handle that situation in a way that does not cause too many failure points. So we know we are bound to fail. We just handle it much more better than situations in the past. Let me talk about data lineage. So for us, this is something that we are working on, but the data lineage is basically telling us how the data landed in the final place. So how is it transformed? So right from the source where the data came in from, what kind of transformations it went through and then how did it land? So by keeping the data lineage tracker, you have a complete understanding of where that source of the data came in, from what cells, elements were used to apply, what business logic was used to apply to that, to convert into that final transformation that you wanted. Then on top of that, money data management and then master data management. Master data management has been there for many years. It's something that's important because it kind of gives you more of a finalized view of how your data should be, especially when there is conflicting data across the board. So it kind of creates this clean data set that you can work against. Master data management creates business rules to make conflicting data more streamlined and a level set across the organization. Any questions? So good. Yeah, so good question. So a couple of things over here. First of all, I mean, there's obviously third party application tool sets that are supposedly, you know, help you do that. So if you heard of TAMR, it was one application tool set that people have talked about about doing data cleansing. How about TAMR? TAMR, T-A-M-R. But for us, I mean, we don't have the luxury of using third party tool. So most of it is written through code. So we have business rules in place. Now we understand a lot of our data, how it should be. For example, SSN. Figure SSN that doesn't come in a form that we realize. Or it could happen that, you know, the data is moved because, you know, extra spaces come in the data set or there is some kind of mumbling of the data. So you have to have business rules in place to handle a kind of, what do you call, challenges, data quality challenges. So we have business rules in place that kind of reads the data set. We have checks in place, we look at the data, kind of convert the data. So if you see something that's broken, if an extra new line is added to the data, goes on to the next line and causes the data to kind of break up, you know, we can bring the merged data back into one line. So we look at multiple things that come and use business logic to write that out. I wouldn't recommend that for everybody unless you understand your data set really well. So if you don't understand your data set very well, it's very hard. Now, third-party data, when you see these issues, it's harder for us to identify that because the data sets are coming from upstream and sometimes we don't understand the use case because downstream our users are more attuned to how they want to use a data set. So it becomes a challenge in those cases. But then it's data sets that we understand that are able to apply better rulesets and quality control on that data. So we haven't solved it completely and I would say some point later we may want to look at solving it by applying third-party tools. But then again, it's also, you know, cost basis. I mean, do you want to put a third-party tool and spend the money for that? Or do you want to spend the time and effort in building rules within your system to handle that? So it's a cost analysis at the point which makes more sense. I mean, spending the time and effort in building the product or building the business logic or having third-party tool to do it. Make sense? So, importance of data catalog. So this one I just slide, I picked up from years ago, kind of like gives you a idea of who is going to be using the data, where they use the data catalog. So for reports, most of the time your reports, metrics and definition, they are pretty much centralized and focus on your business users. Because most business users, they're only interested in the reporting dashboard. The reporter is interested in the metrics and definitions for those reports, but they're not really concerned over the remaining pieces. Your business analysts, they obviously want reports and metrics and transmission, but on top of that, they like data sets, queries, models and views. Then your report developers and analysts, they go one step lower, refine data. Data scientists want to go to the atomic data and then finally your retail developers who actually focus on the source of the data, data sources, systems, where the data comes from, transformations and this one kind of gives idea of focus of where each person, each structure of the organization is focused on which source of the data that it comes from. So I like this because it kind of tells that story of who wants to use what and at what stage of the data when it lands in the final system. Conceptual architecture, that this is my conceptual year 3000 thing. If I can get this to work, it'll be awesome, but it's something that you are in the making. At the very lowest level, it's a services platform. Obviously you want to build something that microservices have something you can automatically source data from multiple places, which you can have to today, but then on top of that, you have a security compliance framework, your AI catalog services, reporting services orchestration. On top of that, auditing workflow, data security privacy compliance, data catalog, all of the data governance tools and then on top of that, you have microservices, API, gateway, your authentication, authorization and accounting, which is your single sign-on and capturing permissions for the users at a very high level. So that's one source of truth for permissioning. And then finally, your systems that it comes from on-premise, AWS, partner, consumer intelligence. So you, sorry, go ahead. So data lineage is, as I mentioned, the same keeping track of how your source of the data like right through the transformations to when it actually shows up in the report. So this specific element, they're looking at trying to understand how the element landed in that place, what are the source of the data, what kind of transformation, what's the business logic that is applied to that. Data lineage will tell you the whole story. So you actually, when you look in the report, you actually know where it came from and this is how it was transformed. So as I mentioned, it's very important for the data governance to happen is you need to have a partnership with your business. The data governance top-down approach is what makes things happen. If you don't have a sponsor for you, it's gonna be harder for you to build this out. So data management is shared responsibility in data management professionals within IT. Business data owners representing the interests of data. Now you should be able to convince your business stakeholders that if they don't do this, their value for the data is diminished. Some business holders, they're only concerned about specific reports. Maybe they'll say, okay, I don't care about that, but some of them will be like, okay, how can I get more value from the data? So in those cases, you can actually convince them to say, you know, if you sponsor this, you can actually see more value. If you have a data scientist, they'll be more on your side because they will be the ones who will also be supporting your use case to say that, you know, we need to see this value, we need to see this kind of extract the value from the data. And the only way we can do this is applying these logic and business transformations to the source data. So we can make structured value informed decisions from this tool supplied to our business stakeholders. So it's a concern, it's accountably for both parts, you know, business and IT. Business data owners are data subject matter experts and represent the data interests of both business and take responsibility for quality and user data. So I'm coming to the end, I'm almost done. So one of the things I wanna talk about, so I don't know how you guys are familiar with this, but Matt Dirk is a venture capitalist. He, every year he comes out with this thing called Matt, which is a machine learning artificial intelligence and data landscape. He came out this last October, I think he'll probably do one this October, but it kind of, you can search for Matt Dirk. Again, I've talked about all of the application tools in the data ecosystem. From here we're just talking machine learning artificial intelligence data. If you see this, I just took a screenshot of this and see the number of application tools. This is humongous. I, it's hard to read even what's in here, but if you look at this, you should actually see the entire, if you search on Matt Dirk, you actually see this. It's so humongous, that's probably like a thousand applications and tools that are used in the environment. You don't even know which ones should be used for your use case. And second thing is a lot of them are overlaps. So sometimes if you look at one tool, it does bits and pieces of what do you want. There's another tool, so there are two bits and pieces of what do you want. So that's a good way just to understand the complexity and the number of tools that are out there. I kind of do everything. So from data processing, data transformation, data hosting, data warehousing, databases, what do we talk about? That's a lot of them out there. Now I just picked up the data piece from this. So the one I showed you previous to this was AI, machine learning and data. Now from data, I just took a screenshot of the data. I wanted to highlight, in the data world, there's just this much, but even in the data piece, it's probably like hundreds of applications. So you can see storage, Hadoop, data lakes, data warehouses, streaming, RDVMS is no SQL database, new SQL database, real-time graph DB. So you can see just from this amount of applications and tools, it's all kind of mind boggling the amount of things that you probably, now most of the time, your day-to-day job, you don't have to worry about this, but it's good to know, at least to know what's out there so that when you do tools evaluation, when you're trying to see what can make your life easier, you actually just understand or maybe this is something that I need to look at. Also kind of talks to what's the future holding for the data world. Understanding where data is moving to how data has been transferred is important. To me, I will always see that data is evolving. You cannot be stuck with what you are today because tomorrow things change rapidly and if you don't understand what's out there, you could be behind the curve. So just keeping your open mind, understanding what's out there will always help you in the future. This is another one that came about a month ago. This is by Lake FS and will mainly focus on data engineering, but again, this is also another one that talks about just from a data engineering landscape, what are the tools that's out there? So for ingest tools, ingest tech, object storage, metastore, open table formats, compute, analytics engine, gate, orchestration, ML ops, data center, KML, observability. So a lot of things that are out there to just look from data engineering perspective. I talked to you about Databricks, which is kind of become the standard for data processing. Snowflake, people have been talking about it as a replacement data warehouse. I mean, a lot of people talk about Snowflake today. They've been, if you know, Redshift is also in data warehousing, though they kind of lost its luster because of Snowflake. Snowflake actually kind of took all the shine from Redshift, you know. Redshift is actually, they actually tend to compete with Snowflake and they release a lot of features to make it feature compatible with Snowflake. So they're trying to get back into the game. But for a data warehouse perspective, Snowflake, from a data processing framework processing, it's Databricks. And the other thing to note now, Databricks and Snowflake are actually trying to get into each other's business. They become huge mammoth businesses that the only way they can grow is when they can actually compete with each other. So now you see Databricks actually saying they have a data warehouse and Snowflake saying they actually have a data processing engine. So it's very funny that everyone is trying to compete with each other by expanding the thing. And obviously all the cloud have their own inbuilt and grown processing landscapes that they can't promote. Everyone is trying to get into each other's in a piece of the pie. And just put references out here if you want. And then I want to talk about Data Con LA. So I host this conference once a year. We're going to be doing it in two weeks, Data Con LA, August 13th at USC. It's a full day of conference talks. If you want more information, go to Data Con LA. I have a complimentary code for you guys out here. So if you're looking to attend, it's valid for 20 people and it expires tomorrow. So feel free to attend. It's at USC down down LA. So that's all I have. Okay. Is that that? It's probably that code. That's a code. But I didn't know it was open at. Grab it. So that's all I had. Any questions? Hopefully this would be informative and help you guys out. Go ahead. How are you supporting? So we just started doing that. So today we are doing managed Kafka. So the plan is to start expanding that use cases. So part of that is what you're doing is using student data. So we're getting student data from the campuses. This is very small subset use case for the student data. We're trialing this out, but the plan is to expand that to more use cases as you progress. But we want to bring in more student data, use that to leverage for student success. One of the, part of that is understanding how we can help our students succeed or graduate their classes better. So we're bringing that use cases in currently. It's a traditional object storage that we're getting, but so we still have all of the data coming in in a traditional like once a day loading the reform. But because we are saying, we kind of committed that we wanted to build this in a more streaming application architecture. We started with a small use case. So we've told the campuses, start shooting the data towards more frequently. So we're doing now, it's a 15 minute push. We kept our APIs of Kafka APIs open. So they push into that every 15 minutes. And then we have a Kafka queue that reads off that every 15 minutes and loads it into a firehose or S3 file system. We have a lambda, what do you call, application that once it loads in, we do that reads off this data that comes in to the S3 buckets. And then process the data on a 15 minute pipeline. So every 15 minutes we process the data and then we load it into our reporting dashboard. Now we rebase the, every day we do rebasing the data because we cannot do the cumulative annotation of the data on top of our data. We don't, because we try to do more real-time analytics. So we kind of rebase once a day, like we redo the whole processing the data at the end of the day. And then we, so when the new 15 minute version of the data comes in, it just applies on top of that. So people can see, it's called mere real-time, not actual real-time because of the fact that it does not encapsulate the real, true, recall, true value of the data until it gets loaded with the whole previous data set. And that happens once a day. That makes sense. Yeah, but we don't, we don't, that it happens only at the end of the day. So the day we just kind of just put it in. So sometimes we kind of say that we probably see multiple records, few more records than we see in the real-time data. We've kind of set expectations, the user that will see this, but at the end of the day, all of that goes away. Any other questions? No, no, all, we are across. Yeah, I mean the use case for straining is right now. Just one of the campuses, we are testing a small portion of it. Once we show success that this can work, we'll expand it to all campuses. Now, one of the other things you'll see in India, understand every campus is independent. So even though we announce it to the group, it's up, I mean, announce to the campuses, it's up to the campus to say if they want to join us or not. The other part of that other is that all campuses also, they have their own architecture and technology. So some of them are very reluctant to do what we want, I mean, to join us. Because they say, oh, we have our own way of processing the data. So we kind of enforce this for all of our, what do you call, batch data. So batch data, we get it no matter what. But a straining data, because it's a small use case, I mean, we started off with one campus. We don't know if other campuses will join. We still get the batch data on a daily basis, but we don't know about. They will, they'll want to join us before the straining people portion of our day. Oh, yeah, so yeah, once the data comes into us, we do it, so we process for all the campuses. But every campus is on their own. So they may, even though they give us the data, they may do their own processing on their end. So I know in Northridge, they're a big Oracle shop and they're still focusing on using Oracle quite a bit. So they're looking to move to the cloud, but even today from what I understand is that they're hosting Oracle on the cloud. And that's what they're doing. They're not following the data engineering practices that we've set in place. We have offered to share our code pipeline, but again, a lot of the legacy mentality is there with most campuses. And they kind of want to stick with that. And plus they all have their own vendors. So they want to stick with their own vendors. They don't want to follow best practices or come. What we do is kind of we push the boundaries of what we have done. Not many people do it in the education world. Like in higher education, what we are doing today, most organizations are not doing. So we have kind of been pushing the boundary on what higher education should be moving towards. So from that perspective, it's harder to convince other campuses to do work, even though you've offered to share our repositories, code base and so on and so forth since I joined two and a half years. Any other questions? Thank you guys. Very cool. I thought it was you, I thought the name, I thought, yeah. Right. Yes, yes, definitely.