 All right. Hi, everyone. Welcome to my talk on unveil the magic with Houdini without Houdini transfer machine learning pipelines with Apache Houdini. I'm Nadine. I'm leading one house of developer initiatives and I've also previously have been at rock set and bows. And I'm passionate about bridging engineering product and marketing to help drive developer adoption. You can find me on LinkedIn at in slash Nadine Farah, or you can also find me on Twitter at and Farah 86. If you're enjoying today's talk, I would love to hear from you. You can tag and follow the Apache hoodie LinkedIn channel at company slash Apache dash hoodie, or you can follow us on Twitter at Apache hoodie. The QR code is a link to the hoodie community. So if you want to follow up a sink, you can find me there. I'll hang out outside the conference room to talk to help answer some of your questions. So for the agenda today, we'll go over the medallion architecture and then I'll talk, then that will lead us into a hoodie overview from there. I'll talk about the incremental processing framework and then we'll do a go ahead and do a case study. So let's go over the medallion architecture. When we think of machine learning pipelines, it kind of looks something like this at a high level. You have you ingest your data, then you have some sort of data management, you process the data, then you create your models and then you deploy it. So in this talk, I'm actually going to focus more or less on the lines of ingesting, managing and processing data. So more often than not, when companies are usually employing the medallion architecture at some level, they use it to help refine and process the data. So let's take a look at what the medallion architecture looks like. So with the show of hands, how many of you have heard about the medallion architecture? No hands? How many? Okay, cool. So this is a typical view, what you may have seen around some content around the medallion architecture, but let's take a, let's walk, let's like do a quick walkthrough. So when data is ingested, it will be unprocessed and stored in the data lake. And typically in the raw layer or bronze layer, you'll have data duplication, change logs, raw event data and more. And data typically here is unstructured. From there, the data will graduate to the silver layer where you'll perform data duplication, you'll validate the data, orchestrate and manage it. For example, do data cleaning, you might file size and more. And then finally, you'll write some joined queries where you'll kind of join all these different silver tables in order to create a fax table also known as like a gold table that can be used by downstream applications like AIML applications, analytics and more. But if you look at this architecture, it seems pretty easy to do, right? But what does it take to really implement something like this? So generally, people approach the medallion architecture with a sample diagram I have shown, and you can see it seems a little bit more complex when you're trying to implement such structure. So in the raw layer, you'll first ingest raw or unprocessed data into the data lake, and you'll create the raw or bronze layer, and I'll use the raw and bronze layer interchangeably throughout this presentation. And then from there, you'll do a full table scan to grab all the data, including new updates and rewrite the entire silver table with augmented data. In this process, you might use SQL or PySpark or something like this to deduplicate the data, to manage the data. You might do cleaning or clustering in that sense. And augmenting the data with the examples I gave is a very manual process. You have to manage and orchestrate all the processes in order to ensure there are no concurrency or write conflicts that can lead to data corruption, data loss, slow reads and more. From there, with that, you'll do a full table scan and join the silver tables. You join will happen and attempt table that you might create in Spark. And then from there, you'll output the results into a parquet file and then create your gold layer or a fact table. The query engine you use will do another full table scan to execute the query and return the results that can be used for analytics and applications. You can see that there's a general theme here of scanning the full table and doing full table rewrites. Also, this whole architecture in itself is very manual. You have to manually file size your data or to avoid the small file problem, or you'll have to clean all data and so much more. And the technologies available in the market kind of encourage this type of approach of building the medallion architecture because they lack a few key things. So let's see what this looks like. And the diagram, each of the zones have the same services repeated, but I'll go over what each of these services are and how it affects the medallion architecture. So let's talk about automated table services. So many technologies don't offer fully automated table services that can automatically help manage your data and maintain your table's health. For example, Spark, you might have to run manual compaction, which is merging smaller files to larger ones, so you can improve query performance, and also cleaning the data to ensure you run faster analytics and compliance. But if you run these two services together at some point, you would have to implement your own optimistic concurrency control mechanism in Spark. And here, if two services are trying to modify a record, one will have to fail or be blocked until the other service has actually completed its task. The incremental framework, the next thing I want to talk about is incremental framework and indexes. So the incremental architecture prides itself in incrementally updating data sets without reprocessing the entire data over and over again. And a key capability that aids to achieve incremental processing is to handle data mutations at record level. This helps avoid reprocessing the non-changing data. Also, adding an indexing mechanism can further help to quickly process these record mutations because indexing helps locate records in the data lake faster and more efficiently. But by not having this, you have to constantly do full table scans and table rewrites. And at terabytes, petabytes, and exabytes scale data, this inefficiency becomes really prominent. You'll have to throw a lot of compute to your application, and at some point it's just not a viable solution. So what's an intuitive architecture that might be more efficient? Let's take a high level look. In the sample architecture with Apache Hootie, you can ingest data into the raw zone. And from there, you can actually just do an incremental pool to pool just the new data and update the silver table. Here Hootie automatically manages all the data cleaning and file sizing and other table management services. And in this case, it becomes a little less operational than what you may have to do at Spark. Now, to build a goal table, you'll still create your timetable where you can perform the join operation. But the changes are incrementally updated to the goal table, so you're not doing a full table scan or a full table rewrite. And then you can use a query engine to do an efficient lookup. But now that I'm introducing Apache Hootie, you're probably like, well, what is that? This brings us into the overview of what is Apache Hootie. So previously, we looked at the bottlenecks of the medallion architecture. But what if we just flipped it and you actually had these features or services enabled? So Apache Hootie is a data lake house problem that provides database-like features to your data lake. And with Hootie, you can have these fully automated table services that continually schedules and orchestrates the clustering, the compaction, the cleaning, the file sizing, indexing and so much more in order to ensure your tables are always up and ready. In addition, you can replace old-school batch pipelines with this incremental framework on your data lake. And Hootie allows for quick updates and delete data with a fast, plug-able indexing mechanism. And this includes support with streaming workloads with full support for out-of-order data, bursty data, and data deduplication. But since Hootie provides an indexing mechanism, you can take the data updates from an upstream database and apply role-level changes downstream to downstream application. And as a result, the incremental framework allows for faster ingestion, lower processing times for your analytical workloads. But Hootie's features and services enable for faster performance. We can go from hours to days on updating downstream applications to just minutes. So let's take a bird's-eye view of what this platform actually looks like. So a lot of times when you think of lake house formats, you think they're just a table format. But Hootie is actually a fully comprehensive platform that's designed for data ingestion and processing. And it's equipped with a wide range of features and services aimed at maximizing the efficiency of both writing and reading data. So at the foundational level, you have your data lake storage. This is while your data resides in some format like Parquet or Avro. But upon this, you have the transactional database layer. And this is where Hootie's true power shines as it offers multiple services, including table services, indexing, concurrency control mechanisms, and others. These services collectively offer significant enhancements to data and table management. And so table services facilitates the handling of data at scale. Indexing helps with the trivial operations and concurrency control inserts at this consistency of data across multiple operations. After the data has been successfully ingested and managed through the transactional layer, Hootie offers the ability to connect with the process data with popular computations like Presto. So by facilitating these integrations, Hootie allows for efficient and advanced querying of data. But now that we have this bird's eye view of what Apache Hootie is, we're going to dive into a little bit of what the table kind of looks like. So a Hootie table consists of file sizes. And each file size contains a base file, which is Dart Parquet. Parquet might produce at some certain commit time or instant time, along with a set of log files that contains inserts or updates to the base file since the base file was last produced. And as you can see in the top half of the diagram, a group of file sizes is known as a file group. So when the writes come in on the top half of the diagram, the records are written to the file sizes, and each record has a key that is mapped to a particular file group. So let's talk about the advantages of having this particular file layout. From the right side, if you have multiple table services running in the background, the services don't block each other because Hootie has multi-version concurrency control. And from the read side, the file layout facilitates the query engine to query a table at a particular point in time. And when we talk about the incremental framework, this becomes a really an important point. Now on the bottom half, when records get written into a file group, Hootie's timeline records the commit action that was done. And the timeline structurally, if you create like a Hootie project, it's located in the . Hootie folder. And it's essentially an event log. And there are different actions that can be recorded to the timeline. For example, if there's a clustering event or a compaction event, basically things that you do to the table get recorded into the timeline. And there are time stamps associated with every action, along with some metadata about it. Now following the timeline, there's the metadata table. The metadata table is structurally different from the timeline in it that it's an internal merge on read table. The metadata table is a central place for all the files metadata. And when a commit happens, the metadata table gets equally updated as well. And you can just think of the metadata table as like this big index. So this brings us to a really good point that Hootie stores state. So when we talked about the timeline in the metadata table, this is how Hootie stores state. So if a record has an update, Hootie checks the record's key to see if the record exists in the file group. And if it does, it updates that particular file size depending on where the record is located. But equally, the timeline will get updated as well. And since Hootie maintains a timeline of when an action or write occurs to a Hootie table, you can essentially find out what changes occur in that time range. And then from there, you can update downstream applications or table with just that data using Hootie's incremental framework. So if you look in the slide here, you can see an incremental query. There's a T1, T minus 1 and T. You can specify the time range that you want to see when the updates have happened and just grab those changes. So before we double click into the incremental framework with the CDC feature for Hootie, let's see how Hootie is being used in the ecosystem. So Hootie is proven at Massive Kill, both Uber, Walmart and GE use Hootie for their mission-critical apps. In particular, ByteDance uses Hootie at Exibite scale for a single table. And even this, Hootie is able to bring that analytics from days to minutes. But one of the ways that Hootie brings it down is through its incremental processing framework. Recently, we introduced the CDC feature with the incremental processing framework, and this brings us into the next section of where we're going to talk about the incremental processing. So to recap, though, the Mendelian architecture represented a straightforward and more simplistic approach of constructing your bronze, silver and gold tables. And the standout feature that spares you from conducting gold tables can is Hootie's incremental framework. Using this framework, only the changes are streamed to downstream tables. So to realize an end-to-end incremental processing, Hootie provides a Hootie streamer to efficiently pull changes from the source and support mutable data and record-level changes and conveniently write the data to downstream syncs all the way from the source to bronze to silver and the gold layers. Here's an example of how you can use Hootie streamer to construct an incremental processing end-to-end. A common use case is streaming the changelogs from a database like Postgres through Debezium and Kafka. Each message has the before and after images reflecting the changes. The schema is registered to the schema registry. In the first step, the Hootie streamer gets the new data from the last checkpoint and bulks inserts them into the bronze layer. The bronze table contains exact raw events from the Kafka source for further processing. Next, another Hootie streamer is constructed to do any kind of data cleaning and augmentation. For example, users can transform their data by flattening fields, selecting relevant fields within projections, and any other custom transformations that they want to do. But once that's done, the data's are upserted to a silver table, which is a clean data set. Once the new changes are landed to a silver table, the subsequent Hootie streamer job conducts a more complex operation with business logic using the SQL provided by you, like joining the dimension tables and other data from multiple tables. But after the complex business logic is applied on the changes, the records are upserted to a gold summary table for data analytics. But one key functionality here is supporting mutable data and incremental processing. So let's take a deeper look at how Hootie takes the changes, handles the mutations, and streams the changes downstream. So when we look under the hood, there are quite a few steps between taking the incremental changes from source and streaming from the Hootie to the downstream. So to enable mutable data at record level, Hootie provides built-in support on locating the records and the record payload and merging so that the user can customize their inserts, updates, and delete logic. As I mentioned earlier, there needs to be consistency between the index and the data. So metadata can be used for reading and writing the table. Hootie provides automatic metadata management on the Hootie timeline and the metadata table. Besides managing the data and metadata, Hootie automatically optimizes the data layout on storage with a small file handling and table services like compaction and clustering so that the query engines can read well-sized files and improve the query performance. Alongside the incremental processing, there could be concurrent writers, for example, backfill jobs to rewrite old data or jobs to delete selective data. So Hootie provides optimistic concurrency control and multi-version concurrency control for different use cases to efficiently handle multiple writes. So now you may wonder how Hootie handles record level mutation which is necessary for incremental processing. Hootie provides a payload merge API for inserts, updates, and deletes so that users can customize what they need. So let's walk through this example. Let's say you have a table to store bank accounts. Each entry has the UUID, the name of the account, and the last updated timestamp in balance. But just like other databases, Hootie requires the primary key field to be specified by the user to identify the unique records. And the primary key field here in this example is the UUID. For each incoming batch, Hootie looks at the primary key, the UUID to identify whether an input record is an insert, update, or delete. In incoming batch one, we have one insert and one update. During the insert operation, the table is updated by inserting the row with UUID 3 for me and updating the existing record for Ethan's account. For the next incoming batch, the first record is marked as delete, and Hootie deletes this entry. The second record is another update for the XYZ account. Now if you look at the results on the insert operation, the balance for Ethan's account is not changed, and this is because we want to honor the balance of the latest timestamp. In this case, Hootie looks at the ordering field and makes sure to ignore the later arriving data from the application's perspective, so my account won't get $20 from nowhere. Hootie has built-in support for event time ordering, which is prevailing in streaming and incremental processing. But once the data mutation is done, Hootie attaches metadata such as the commit time and the file name to each record and also updates the metadata in the Hootie timeline, and these essentially serve as a state for streaming changes for Hootie table. So while the record level mutation, while for record level mutation, the primary key is required, or in some use cases like event log ejection, which inserts your data, Hootie supports automatic primary key generation so you don't actually have to specify primary keys, and this is a feature that we've released recently. So aside from the existing incrementing pools, Hootie provides a new CDC mode for incremental processing, and this provides a debesium-like change logs with the before and after images, and here in the sample code, you can read the incremental data in the CDC format. For inserts, the before image is null, and the after image is the new value. For update, the before and after images show the values for the before and after the changes, and for deletes, the before image is the record, and the delete image that comes after is null. In the CDC feature, you can use this data to further transform and process both before and after values. So let's do a case walkthrough on what this kind of looks like if you were to build an application. So when we talk about the machine, what this machine pipeline kind of looks like, I'm going to focus on the ingestion tables, the optimal and processed and refined customer's data, and then from there, once you get the data, you can deploy it. So here's a sample architecture that we'll walk through, and I'm going to take each portion and walk through it. In the first section, you'll have a person signing up for an account, making purchases, clicking on stuff, and updating their cart. From there, the data is either sent to a transactional database or a streaming source like Kafka, and then dumped into a data lake creating the raw layer. When data is inserted into the raw layer, it'll be indexed for whatever index you want to use, and here, yeah, you can pick whatever index you want to use. And then from here, we have the clickstream data. So you can see the clickstream data has a couple of fields, like session ID, URL, and description of what that is. You have the sample purchase schema, which includes the purchase ID, the quantity, the purchase price, and there is a description for what that is. You might have also a sense of cart activity. You might need the customer ID, the product IT, the activity type, whether they add it or remove the assets and the quantity that they want, how much they want to purchase it and much more. And then you have the customer schema, which is typically, you know, the customer's metadata. So to create the silver layer, the data is incrementally pooled where you have row-level updates occur. And Hoody only applies updates to records if there's new data and basically avoids the full table scan. From there, you can create a temp table and join the various silver tables to build a gold or fact table. So then after you perform the join, Hoody will only incrementally pool the updates to the gold table and update their appropriate records without doing a full table scan. And if you want to write a simple join query, this is what it exactly looks like. You can get the customer's first name and last name, the clickstream URL, the time step and product information, and here we're correlating the user's activities with purchases. From there, you can perform a left join on the clickstream via the customer ID and you can also do another left join on the purchases based on the customer ID. You can filter through this time range and then you can order by the time step and purchase date. Once you get the results, you save it to a gold or fact table and the query engine can be used to read the data and results can be used to populate downstream applications. You can move it to a model and then from there to play it. So this wraps it up. I'm going to talk a little bit about Hoody's roadmap really quickly. We recently released 1.0.0.beta1 and this release has a couple of game-changing features. One thing that's pretty cool is we have non-blocking concurrency control for high-streaming rights. So multiple rioters can operate on the table with non-blocking conflict resolution and this can kind of reduce the wait times or the bottlenecks that you might have and it's actually ideal for high-streaming rights because transactions can proceed independently and concurrently leading to increase in throughput and overall system responsiveness and this is really good if you have CDC workloads, you can link in much more. The other thing that's pretty cool is that Hoody now also supports functional indexes and it's built on top of Hoody's multimodal indexing subsystem. It's essentially an index on a function applied to a column and it enhances access speeds and integrates partitioning into the indexing process. You can easily manage these indexes through SQL syntax so you can create index if it doesn't exist on some table using some column name and then you can probably provide some of the best use. If you want to learn about the One Beta release you can visit hoody.apache.org or when these slides are available you can check out the RFC69 which has more details on this. You can come and build with the community and if you want to learn more about how you can use Apache Hoody for your machine learning pipelines there's a lot of resources here. You can follow us on LinkedIn and Twitter. You can also scan the QR code to follow us on Slack. We host weekly office hours that will be great to come through that and me and a PMC member are usually there to help answer questions. I think that's it for me. I want to thank you for attending my talk and I'm happy to take some questions now. Thank you. Oh cool. The compaction kicks in? Yeah. Long running jobs. Okay. So this requires a lot of more questioning. Are you on the hoody Slack? Okay. And have you posted the questions on the general? Okay. I'm usually responding. Maybe I missed your question, but usually one of the things I want to find out is the error message of where compaction is failing. So I need the full stack error to kind of see what is going on. Some things that could be a resource issue. I don't know if you have multiple services running together and maybe there might be a conflict there. I need to look at the convicts and see what is going on there. So I would recommend shoot me a message on Slack, tag me and tell me that you're at the session and I'll make sure to get to it sometime today. Because I need more, I need to know the hoody version. I need to know more things about what you're doing to kind of debug the compaction failure. Yeah. Let's follow up on Slack. I feel like I need to triage it more. But thank you for your question. Any other questions? Like machine learning in general? Or... Okay. How are you dealing with some of the machine learning pipelines? Or what are you using? Oh. Redis? Okay. Got it. Sounds good. All right. Thanks, guys.