 Well hello everybody and welcome again to another OpenShift Commons briefing, Diane Mueller, the director of community development for the OpenShift Origin project and running the OpenShift Commons. And I have a special guest to kick off a series on big data with us from the Project Daton at Red Hat Will Benton and he's going to give us an overview and introduction and all those kinds of good words into big data and Apache Spark in the context, a little bit of OpenShift, but really just to lay the ground for further conversations. What we're planning on doing and we'll talk a little bit about it at the end of this is kicking off an OpenShift big data special interest group. So if you go to OpenShift Commons.OpenShift.org and under interests you'll be able to click through and join the big data SIG and we will have another part two of this series and another number of other follow on guest speakers in that SIG to kick off those conversations and find out what the different use cases are for big data in the context of OpenShift and containers and all things cloudy like that. So without further ado though, I'd like to give Will Benton the floor to lead us through the beginnings of this conversation about big data. Thanks Dan. It's morning in my time zone, so if your time zone is close to mine, I'll wish you a good morning. If you're on the other side of the world, I'll wish you a good morning, although it may be the part of the morning that you don't want to be awake for. So hello everyone, thanks for being here. I'm Will Benton. I work at Red Hat on data processing. And today I'm going to introduce big data and Spark on OpenShift really by talking about what it looks like to have a big data application in containers. Now big data is sort of a marketing term, right? It's useful because we all know what it means, but you know, you sort of get into this question of how much data is big data. And I think that it's sort of an interesting question because a lot of real-world data processing workloads aren't actually all that big in an absolute sense or compared to the kinds of hardware we have to run them on. You can buy a machine with a terabyte of RAM pretty easily these days. So it's really a convenient, when we talk about big data, we're talking about a sort of convenient shorthand for a constellation of issues that a lot of applications have to deal with. And that's doing interesting things with a potentially uncomfortable amount of unstructured or semi-structured data that's arriving quickly from diverse sources. And so when we're talking about big data, we're really talking more about a set of challenges than about a particular working set size. And so I'm going to use the word for the rest of this talk, data-driven. We're going to explore what a contemporary data-driven application looks like in containers. So I'll tell you a little bit about my background to start out. I joined Red Hat almost eight years ago and started working on cluster scheduling and cluster management products and contributing to open source distributed computing projects. For the last few years, I've been working primarily with big data and data processing. And for the last two years in particular, my focus has been leading a team of data scientists and engineers as we figure out how to use cool new open source frameworks and tools to solve real-world analysis problems at real-world scale. So I'm coming to you today just to share my perspective as someone who's developed these tools, applied them and put them into production in a contemporary environment and an open shift. So here's what we're going to talk about today. I'm going to start with some background about what the future application development is looking like from my perspective. We'll talk about what a data-driven application has to do and a few architectural options for organizing these responsibilities. And then we'll put everything together by showing some of the tools and frameworks you can use to actually implement these architectures in open shift. So this first point I hope is not controversial. I think for many people on this call or watching this talk later, microservices are not just the future, they're the present, right? I assume that many people have developed microservice-based applications and have container-based microservice applications in production now. But for those people who aren't yet convinced, I'll just sort of level set and go into some brief definitions and talk about the advantages here. So if we structure our application as microservices, which I'm defining as primarily stateless lightweight services that communicate via well-defined interfaces, then we have a bunch of benefits essentially for free. One is that these stateless services are easy to deploy and manage because we can run them in stateless containers and on stateless container orchestration platforms like OpenShift. These services are commodities. If one of them goes away, we can spin another one up to replace it trivially, sort of cattle if you're using the pets and cattle metaphor. They're easy to scale out because you can run them on a single machine or you can spread them out across multiple machines. If you have one service that turns out to be a bottleneck, you can load balance behind a proxy and so on. These services are also easier to test and debug than their stateful counterparts. Instead of managing complicated test figures, fixtures that set up a bunch of state before each test and sort of reproduce a sequence of events that caused a stateful service to break, you can simply ask a question to your stateless service, get back an answer and know whether or not you have a bug. Those are all technical advantages but there are also really political and social advantages to this kind of application development and I think those are most interesting. It's much harder to organize a team of developers to work on a big monolithic application and it's harder to onboard new developers onto a big monolithic code base and it's harder to diagnose performance or correctness problems in a black box. But if we divide an application up in simple communicating components that have explicit contracts for their behavior, it's much easier to scale out the team that will work on it and this team to be productive. So when I say that data-driven and data-intensive apps are the future, I'm cheating a little bit again. All right, it's sort of an easy way to make predictions about the future to sort of say something that's already happening and for a lot of people these things are the present. But increasingly it's not enough to have a great application that attracts passionate users. You need to take the data your users provide and the data they generate and use it to understand your application domain and the world. You can then exploit this understanding to provide additional value, offer new functionality or achieve greater efficiency. People are really getting used to apps that get better with popularity and longevity and will increasingly demand these data-driven apps in the future. So we'll start by looking at a few example applications that have been ahead of the curve on using data to provide a better customer experience starting with a particularly famous and ubiquitous one. So I'm not looking at the chat window right now, but I'm going to assume that we don't... Do we have anyone here who has never bought anything from one of Amazon's retail websites? I'm assuming not. If so, I guess I want to know about it. But I'm mentioning Amazon retail not just because it's a great example of a data-intensive and data-driven application, but if we even just look at the user experience on their site, the sort of user-visible experience of Amazon, we can see some obvious shifts in how they've used data. So to see those in action, let's think about what retail looked like before Amazon. We'll have to go back to the mid-1990s. So some of you may not have been shopping in the mid-1990s, and you may need to recall that an album is a generic term for a physical copy of an audio recording. You can think of it as an expensive, cumbersome, low-capacity, and somewhat hard-to-reproduce way to carry on a playlist of songs. So if we go back to the mid-1990s, if I wanted to buy an album in 1994, I'd drive across town, or maybe across the county, to go to a record store that had a good selection and knowledgeable taste-making clerks. They knew me and they knew what I liked because I spent a lot of time hanging out in record stores in the mid-1990s buying albums so they could recommend recordings I'd probably enjoy, pairing down their selection to things that would be more likely to fit my tastes. These clerks also knew enough about music to be able to answer questions about their stock and about music at large, even from people they didn't know. I just saw this band in concert, which of their albums is best. If I like this band, well, I also like that band, and so on. And there's something really great about having that connection to a community of other fans and getting guidance from experts. So let's fast forward a few years to the late 1990s. At this point, Amazon.com has been in business for a few years and has expanded beyond books to also sell compact discs and movies. So if you want an album that's in print, even if it's a rare one, Amazon probably has it at a decent price and can get it to you within a week or so. Now, a lot of people thought at the time that having an efficient shopping website was the real innovation of Amazon, but even then it was pretty clear that a lot of their value was in the data they'd collected. Instead of just giving you a boring, publisher-supplied metadata about now more press releases, Amazon would give you a bunch of scored narrative reviews from other users. So you could use this information to some extent to answer the kind of taste questions you'd have needed an experienced record store clerk for just five years before. The next step for Amazon, which they had full swing by the middle of the last decade was to use aggregated reviewing and purchasing data to make product recommendations. So now, instead of reading every review, you might just let Amazon show you only the reviews that other users had marked as most helpful. If you searched for a product and decided not to buy it, you wouldn't have to look at every possible alternative. Instead, Amazon would tell you what things people who looked at the same product ultimately bought. Amazon also started proactively suggesting products they had reason to believe you'd like after you bought something when you visited their site or over email. Now everyone has an example of this backfiring, right? You buy some durable good, like a television or infant car seat and immediately get recommendations for other items in the same product category. You know, like people who bought this vacuum also bought other vacuums. So these emails weren't perfect, but they were often very good. And I actually had to turn them off because they started to get too expensive. Now Amazon uses data for all sorts of other things as well, optimizing logistics, setting prices, even deciding what sorts of original content commission for their digital media offerings. And that's just on the retail side to say nothing of what they do with web services and so on. But even just the most obvious ways that Amazon's retail operation has used data over the years provides a pretty interesting picture of an increasingly sophisticated data-driven application. Now Uber has been an enormously successful example of a sharing economy company by serving as a matchmaker between short-term contractors and clients. They have a great app and identified a market where they could have an impact, but as we'll see, the quality of their product depends a lot on their use of data. So for a little bit of background, the conventional taxi industry is heavily regulated and has a high barrier to entry. Taxis typically must be licensed by the municipality in which they operate. And there are often a fixed number of licenses per city. So the licenses may be far more expensive than the annual salary a driver might expect. As an example, at its most expensive a few years ago, a taxi medallion in New York City cost $1.3 million. A New York City medallion is around $700,000 now, but depending on whether you ask the Bureau of Labor Statistics or Wikipedia, the median taxi driver in New York City only makes between $30,000 and $50,000 annually. Furthermore, rates for licensed cabs are fixed by the municipality in which they operate. So there's no opportunity for a real market for cab services. Prices stay the same regardless of supply and demand, meaning that it may be difficult or impossible to get a taxi at busy times from unsafe neighborhoods or to distant destinations. This is an industry that was ripe for disruption and Uber was there to do it. Uber provided a great app, but that was just the beginning. Anyone can become an Uber driver without spending most of a career's worth of gross salary on a taxi medallion or risking the uncertainty of leasing one. Uber efficiently matches drivers to passengers and their app lets passengers know where the driver is and when to expect a pickup. Instead of using licensing as a rough proxy for driver quality, Uber uses a rating system to isolate bad drivers or bad passengers. They also use the information they have about how many drivers and passengers there are at any given time to create a truly efficient market for passenger transportation. If there are four more passengers than drivers, prices will go up, which incentivizes more drivers to get out on the road and ultimately ensures that more passengers are served. Uber also uses the data they have to detect fraud, predict passenger destinations, and support a lower-cost carpool service. Interestingly, one of the taxi industry's responses to Uber's success has been to create several different apps for people to hail licensed cabs, essentially assuming that what people want is just a conventional taxi service that has a smartphone app. Now it's possible that some people really want this, but ignores that the real power of what Uber has done comes from their aggressive use of data to improve their products and services, not from their app itself. So this third and final example of a data driven app is one that's close to my heart. I'm an avid cyclist and Strava is an activity tracker app that's also a social network for athletes. So let's think about fitness tracking before the smartphone era. Only serious fitness enthusiasts a decade ago had a dedicated sports GPS device for tracking their rides or runs. Everyone else was left with a stopwatch and a notebook. After phones with GPS wide receivers became widespread, a lot of companies developed activity tracking applications, but most of these offered only minor differentiation from each other. Strava differentiated itself by offering a great user experience and excellent social networks so you could follow and encourage your friends and rivals and a really addictive feature. The really addictive feature was a competitive aspect. They allowed users to designate parts of their activities as segments like for example, a local hill climb, a town line sprint, or a stretch of County Highway along the lake and then see leaderboards over time of how their rides compared to the fastest people ever to ride on those segments. Some of whom turn out to be professional athletes and some of whom are just enthusiastic amateurs. It's exciting to see how you compare to a pro but for a lot of people who wanna compete with their friends without racing this feature has proven irresistible. Like the other example apps we've talked about Strava leveraged their popularity to collect a lot of data and then they used that data to make their product far better. The first data driven feature I wanna call it is routing but let's motivate it a bit first. Think of going for a ride or a run in an unfamiliar place. You probably won't just head out even if you have a map. You might get lost, you might not find any good roads and you might even wind up someplace unsafe or unsuitable for cyclists or runners. The routing application you'd use get turn by turn data in a car probably weights candidate routes based on distance, speed limit and possibly traffic or road closure data with the goal of getting you from point A to point B as quickly as possible. Because Strava had millions of rides to analyze they were able to make a routing application that weighted routes based on the goals of endurance athletes like elevation profile, safety and frequency of interruptions. After all, it's hard to get a workout if you have to stop for a light at every block. Now the underlying map data may not reveal which roads are safe or which have nice paved shoulders or which are free from interruptions or have beautiful views but you can be pretty sure that roads like those will see athletes spending more time on them than the alternatives will. With a large enough population of athletes doing activities road popularity data is a reasonable proxy for safety and suitability and Strava was able to use this data to make an athlete focused routing feature. Strava also uses data in some other interesting ways to provide features to their users. For runners they analyze a series of runs clustering similar ones together to provide you with a picture of your performance over time even if you aren't always doing the same route or running on the same kinds of terrain. For cyclists they've spent a lot of time on data engineering and data cleaning to improve the noisy altitude data that some GPS units like the ones in many phones provide so they're able to use this data to offer more realistic calorie burn estimates than many other fitness tracking apps. Finally Strava has used data as a product itself offering data both to users and to non-users. Their basic service is free for all users but they offer a subscription service with some additional features. The subscription service offers a variety of analytical tools for activity data but one of the very first features they introduced was essentially the kind of roll up and drill down functionality you'd look for in a business intelligence reporting tool. Essentially allowing paid users to see not only where they stood on overall leaderboards but also on leaderboards consisting only of people in the same age or weight range. They've recently been selling aggregated data to municipalities that is non-users about which routes are most popular for cyclists and runners and their goal has been to inform local transportation policy. Now on that front I don't know if any cities have built new bike lanes because of Strava data but some counties have recently banned mountain bikes from multi-use single track trails after seeing how fast people can get going on them. So we've seen how three great apps have used data to become even better. I'm sure you can think of many other examples of great data-driven apps and you're probably thinking of ways your apps can become more data-driven. So we're gonna start now by talking about what your applications need to do to process and take advantage of more data and how you'd actually build these kinds of capabilities in a contemporary environment like OpenShift. We'll start by laying out some general concerns all data intensive applications have to deal with and then look at some application architectures that have been designed to address these. Ultimately we'll focus on architectures that are especially suitable for contemporary container-based application development. So let's start by taking a high level look at some of the responsibilities for typical data-driven application. Fundamentally, a data-driven application does everything that your conventional web application might do but it also aggregates and transforms data for multiple sources. Trains predictive models based on that data and then uses those models to transform and make predictions from new and existing data alike. It saves raw and processed data to archival storage. Finally, a data-driven application needs to support a few different kinds of user interfaces and internal facing interface for developers and data scientists to install new models or modify how they're trained. Web and mobile interfaces for end users a reporting and alerting interface for the business and ops side and a management interface for deploying, orchestrating and monitoring the actual components of the application. Now, there are a few logical components necessary to support these responsibilities. As you might imagine, a lot of these components have to do with data sources, data processing or data presentation. And some of them have obvious counterparts even in applications that are not particularly data-driven. We'll need to handle data from multiple sources with different characteristics. Transactional and telemetry data, among other kinds of streaming data, could arrive a record at a time as events on a messaging bus. We may have structured data in a relational or NoSQL database and structured or unstructured data in a distributed or local file system, which we can also use for archive storage. Now, in the past, people have thought about analytics as a separate workload that gets run in a data center, but in a data-driven app, analytics and compute is really just another component of an application. So we'll need a compute layer to actually do the work of data processing and analytics. The models that a data-driven application trains are compact summaries of data that we can use to make predictions from. For example, an object we can use to say, given these purchases, what else is this user likely to want to buy? A structure that supports efficient queries of trending topics by day or location on a social network or a series of questions to ask about a financial transaction to determine whether or not it is likely to be fraudulent. These are a lot smaller than the data they were trained on and are typically much faster to evaluate or make predictions from than they are to train. Since models only place relatively minor demands on compute and storage, but need to be accessed quickly, we can keep them in the same kind of operational store we'd use for session data in a conventional application, a fast and lean database or key value store. So the one problem that every data-driven application has to solve is collecting lots of data from various sources, manipulating and normalizing it and federating it behind a common interface. This is often called ETL, which stands for extract, transform, and load. This is not only a common problem, but it's also a big problem. For many data science projects and data-driven apps, data wrangling, that is the cleaning, transformation, normalization, and so on, winds up being the vast majority of engineering effort. A data science team might spend 80 to 95% of their time on data wrangling tasks, compared to a much smaller amount of time developing model training code. We can evaluate various solutions to the problem of data federation by asking several questions. And I'm gonna have us consider these as we look at application architectures. The first question is, what is the source of truth? Is there a canonical place for our data? And where is it? Just like with scientific experiments or continuous integration tests, we want our data-driven apps to have reproducible results, which means we need to be able to get at the pristine, unprocessed data in order to replay it later, especially if we wanna make some changes to our data pipeline. The second question is how well our federation solution scales. Actual real-world analytic workloads typically aren't very large, because they operate on pre-processed subsets of data, but the raw data can be enormous. So we need a federation solution that can scale up by devoting more resources to it as our needs grow. The third question is an interface question. How convenient does a federation solution make to access your data? And are there access patterns or use cases that are unnecessarily difficult? Do you need to jump through hoops to get your data out? Do you need to use different libraries to post-process federated data for different analytic tasks? A very simple data federation solution of just printing everything as text files to magnetic tape and putting it in a shipping container is scalable and can hold a lot of data and it's containerized, but it isn't particularly usable. Fortunately, most of the solutions we'll see provide better trade-offs between capacity and usability. We'll start our overview of application architectures by looking at two that have been around for quite a while. In a traditional data warehouse architecture, we federate data in a relational database that is optimized for fast concurrent updates. Our application processes a stream of events, transforms these, maybe applying some business rules, and then updates the database. Each of these internal components can communicate results back to the others. I'm calling out the UI here as a separate component because users who interact with our application manually also generate events, but the output they get is also determined both by business logic and by the contents of the database. This database here is optimized for fast concurrent updates and we call this part of the architecture the transaction processing part. This transaction processing database is the source of truth since it always has current information about the state of our system. To extend our basic transaction processing application with analytic capabilities, we'll periodically dump the contents of our transactional database to a database optimized for fast reads and complex queries. We'll use this database to support various kinds of analytic processing, primarily to support reports that project our multi-dimensional data into an easy to understand spreadsheet that lets people quickly drill down and roll up along any dimension. For example, looking at quarterly sales by division, product, region, or salesperson. We can also feed results from our batch analysis back to the business logic and allow analysts to do exploratory queries against the database. We call this the analytic processing layer and it depends on a database that's optimized again for fast reads and complex queries. The disadvantages of this approach are that the transaction processing layer is stateful. You can't run this in a container orchestration platform very easily. Also, relational databases have historically been very difficult to scale out. Typically, if you need to get high performance from relational database, your approach has been to buy the most powerful machine you can possibly afford to run it on. Because of these things, these are transaction processing layers are typically difficult to manage, certainly more difficult to manage than stateless containerized services. And the other problem is that there's no raw data in our source of truth. If our source of truth is the transaction processing database, we've already transformed things by the time they get there. If we decide that there's a problem with the way we've transformed them or wanna change our pipeline at some point in the future, we don't have a way to do that without a lot of extra work. On the analytic processing side, we have precise results that lag behind the source of truth. So we may be able to get a report for all of our sales through last week or through yesterday, but we don't know what's happening right now. Again, these are difficult to scale out, although since the database is more read-only, it may be easier to scale out. A really important consideration for analytic processing is that you're really limited to the capabilities and interface of a relational database. So things that you can express as SQL queries or as extended SQL queries or in a procedural extension language for a database are easy to do, but things that aren't necessarily easy to do that that might require iterating over models, iterating over records several times or looking at, for example, graph reachability may be very difficult to do in a database. Certainly training predictive models is not something that's obviously easy to do with a traditional relational database. So the second approach to look at is the data lake approach, and this is popularized by the Apache Hadoop project. And this federates data in a distributed scale out file system. So your interface for accessing data is the file. If you need to add more space, you can add more nodes to increase your storage capacity, but the assumption is that once you've added storage nodes, they'll stick around essentially forever. As new events arrive, they're appended onto files in the distributed file system. Now the raw data we collect from events isn't likely to be in the format we want. So we'll write some components to perform ETL on this data and analyze it in parallel. The Hadoop runtime will schedule these by migrating compute jobs that operate on parts of our data to the machines that contain those parts. So we get scale out compute atop our scale out storage. Unlike classic data warehouses, Hadoop enables scale out storage and processing. And it also provides a fairly easy way to keep raw unprocessed data in the source of truth, that is the distributed file system, along with multiple versions of post-processed and intermediate data. However, it's still not a great fit for a contemporary container-based architecture. Because Hadoop depends on having compute jobs co-located with the storage they're operating on, you won't be able to scale your compute and storage independently. Because Hadoop expects that your storage nodes are long lived, you essentially have to dedicate resources to both storage and compute. In practice, this means that a lot of Hadoop shops are stuck doing everything on one cluster big enough to support all of their enterprises' storage needs and managing that cluster for a wide range of compute users and applications. There's nothing wrong with that, but it prevents us from enjoying all of the advantages of container-based architectures. Finally, there are a couple of technical reasons why Hadoop might not be the best choice for new projects. The main programming model it exposes is pretty low-level. So most Hadoop applications rely on third-party libraries, all of which have different interfaces and different expectations, which means that users have to spend even more time on ETL tasks to ensure that they can glue together these libraries. Finally, because of how Hadoop orchestrates computations, it isn't really suitable for doing live analysis of streaming data or interactive analysis of batch data. So we've just reviewed a couple of legacy architectures. These are probably familiar to you and they solve a lot of problems, but they really assume a monolithic application architecture and an inelastic view of our compute resources and our application's requirements. They certainly aren't a great fit for running as microservices or in containers. Now we'll look at some architectures for data processing that work really well with microservice-based applications, along with some example applications that use these. The Lambda architecture is an interesting approach to modernizing the traditional data warehouse. The idea is that you have a stream of events arriving and you're gonna multiplex it to a storage layer and a stream processing layer, which is gonna perform imprecise but low latency analyses on the data as it arrives, giving you a view of your very latest data. This stream processing component is called the speed layer. Meanwhile, you have some batch jobs that are repeatedly operating on the aggregated data, performing precise analyses, but at a higher latency. This batch processing component is called the batch layer. Finally, the most recent results from the speed layer and available precise results from the batch layer are federated in a serving layer, which presents users with the view of their data striking a compromise between latency and precision. So there are some major advantages to the Lambda architecture over the two legacy architectures we looked at earlier. The first is that it's suitable for microservice architectures and that unlike the traditional data warehouse, raw data remain available. However, like the traditional data warehouse, the Lambda architecture requires that you implement your analyses twice in two different ways. One says streaming analyses in the speed layer and one says batch analyses in the batch layer. These are probably gonna be with different APIs, maybe even with different algorithms, just because of how you have to implement streaming jobs versus batch jobs. So that's a real development cost to consider. However, this is a suitable architecture for a lot of problems. As one example, let's look at an application that does infrastructure log processing. If we have infrastructure log events arriving in a stream, we can multiplex them to a stream processing component and to a document database for storage. We can do some immediate analysis, like grepping for particular strings that we're interested in in the log messages. And we can also do some more complicated analysis in the batch layer like training a predictive model. We can combine these to present a unified user interface that has both alerts based on things we've automatically detected from the streaming data and we can incorporate an anomaly model that we've trained in the batch layer and communicate that back to the streaming layer to provide more precise alerts later. The Kappa architecture was named in response to the Lambda architecture and its design reflects that our message queue systems, our stream processing frameworks and our understanding of how to design streaming analysis algorithms that is algorithms that only look at each record once and update a model in place have dramatically improved since the introduction of Lambda architecture. The basic idea is that as you receive events, you put them on a message bus and you give them a particular topic. So the component that would be doing ETL in another architecture is simply listening for messages on the bus with a raw data topic and then writing a message with a pre-processed data topic as output. Similarly, a job that does analysis reads pre-processed data, writes analysis results and so on. And then our user interface components are really just clients of this analysis results topic. So this model assumes that our queues are durable and infinitely replayable. It also assumes that each topic can be partitioned to provide scale out and that our problem can actually be solved by a streaming algorithm. So those aren't crazy assumptions but they are things to keep in mind because they're not always gonna be the case. With that said, the Kappa architecture has a lot of advantages. It is suitable for microservice architectures and since the stream is the source of truth, if you can always replay the stream, you can always update some components of your analysis and get the new results. However, it requires a sophisticated streaming framework, potentially a lot of infrastructure work and streaming algorithms, which can be trickier to implement than batch algorithms. A really cool example of the Kappa architecture at work is the infrastructure for the Fedora project. So Fedora is a Linux distribution. There are a lot of moving parts in maintaining a Linux distribution that need to communicate. Coupling them all tightly together wasn't the right solution for Fedora, it was unsustainable. So the mechanism that ties all of these components together is a message bus. When someone updates a package in a Fedora Git repository, maybe going to a new version of the upstream source or adding a patch, a message gets sent out to the bus. When someone edits a Wiki page or starts an IRC meeting, a message gets sent out to the bus. When a package build starts, succeeds or fails, a message gets sent out on the bus and so on. The Kappa architecture works really well here for coordinating components that logically depend on one another and for supporting a wide range of interesting views of the data generated by one of the world's largest open source communities. Another flexible architecture for data processing involves moving the data federation problem from the storage layer to the compute layer. Instead of federating unstructured data by storing it in a file system and defining ad hoc ETL pipelines, we're explicitly structuring it upon import into a database or message queue. You can use a compute layer flexible enough to give you a uniform abstraction over multiple data sources. Two systems that do this are Apache Spark, which unifies various data sources under its RDD and data frame interfaces and JBoss data virtualization, which provides a virtual relational database that executes query plans across multiple different data sources. Again, this architecture is suitable for microservice architectures. It's typically flexible and interoperable with other systems instead of worrying about how to convert data explicitly from what one system expects to another. You have a bunch of connectors to get data into your compute layer and then you can access it under a uniform interface. Finally, a really nice advantage of this approach is that you can use it to support both batch and streaming workloads, depending on the compute layer you've chosen. As an example, let's look at an application that models infrastructure costs for applications running in a public cloud. And here we have a couple of different data sources that we're gonna federate and use to train a predictive model. So we have billing data from a service provider that's in an object store. And we have service telemetry from a monitoring tool that's stored in a legacy relational database. We can incorporate both of these data sources into Apache Spark and use them to train a model saying based on how we're using the system, how much it's likely to cost. So based on operational characteristics and telemetry data, we can have a prediction of how much the system is gonna cost to run in a given month based on how much it's cost when it's had similar usage in previous months. So we've seen a few ways to design a data-driven application that we can deploy as microservices in containers. We know what architectures work and won't work in contemporary environments and we have a pretty good idea of what kinds of applications these architectures are gonna be useful for. So I wanna close out this talk by emphasizing that a lot of components that you're already using to build applications, whether you're talking about monolithic applications or containerized applications, are easy to use in this new context. And this is true both for interesting open source community projects and for Red Hat products as well. So let's go back to our application responsibility diagram. And we're gonna start at the bottom here. Obviously we're gonna be deploying, orchestrating and managing our applications on top of OpenShift. For processing streaming events, you have a lot of options including JBoss AMQ and Kafka that can run in containers. JBoss data virtualization provides you with a unified interface to legacy databases and reliable distributed file systems like Gluster and Ceph are a great option for file and object storage. Both as a source for input data and as a destination for archived data. My team has had a lot of success using Apache Spark as a compute layer for internal applications over the past couple of years and the model of federating data in the compute layer is a really great fit for microservice-based applications. For an operational store providing a fast cache of models and session data there are a lot of great key value stores in document database options including Redis, MongoDB and Infinispan or JBoss Data Grid. You have enormous flexibility to choose components to make up the user facing parts of your program as you would with any containerized application. Obviously there are options for mobile application backends that work really well with OpenShift and if you need to incorporate business rules into the reports your apps generate, JBoss BRMS, the Droules project is also a great option. Now I'm not mentioning all of these so that you'll go out and buy everything Red Hat sells. I guess I wouldn't discourage you from doing that either. I'm just pointing out that a lot of these products and technologies you're probably already using in production have a place in the data-driven apps you wanna be developing today and tomorrow and deploying on OpenShift. So for next steps, I have a link here that I can paste into the chat later to see a demo of a developer workflow for a data-driven app on OpenShift. This is my colleague Mike McHune showing what it looks like to take a log processing application that uses Apache Spark on OpenShift and update in response to the actual code changing very slick demo. I have a container you can run on your local machine to try out Apache Spark on a notebook interface. It looks like that link is missing from the slide, but I will also put those instructions in the chat. And then I would say the next step from there is really to build a data-driven app on OpenShift. So at this time, I can take questions. Thanks for your time. Here's how to get in touch with me if you need to do so. Awesome, Will, thank you very much for that. That was probably the best overview I've seen of all the different approaches and models for using data-driven or big data on any platform. So much appreciated. We don't have any questions right now in the chat. I think because of the nature of this, there's a lot to digest there. And what I would suggest is that we're gonna have part two of this session upcoming. The next session that we're gonna do is gonna be working with a big data application in OpenShift and Spark, basically using the OpenShift foundation set, sort of an almost hands-on demonstration on developing a big data app on OpenShift. And so it's gonna be very similar to Mark McCune's Vimeo thing there. We'd probably be walking through that and answering questions in deep dive on that. So that should come up in, I think, two or three weeks depending on people's vacation schedules. We'll do that. And anyone is interested in joining the mailing list for future big data events and talks, such as this series. I think we've got a series of four different talks set up already and a number of other commons members interested in presenting their use cases as well. So there's quite a lot to talk about in this space. See if we've got a question there. Yep, someone is, yes. We'll definitely take you up on your offer to do that demo live and find a time for that as well. So if there's isn't any questions, what I'd also ask Will is if you can send me your slide deck or the link to the slide deck because there was some excellent information in there and we'll repost this recording and all of your links as on blog.openship.com probably in two days time. That's about how long it takes to edit the videos. And we'll definitely have you back and do a lot more talks on big data because this has been pretty much a hot topic at all of the conferences that I've been going to lately. And there's a lot of big data analysis that we do under the hood. And I've seen a wonderful presentation. Diane Federer did at ApacheCon on analyzing OpenShift Online's log files. So there's great content out there. And if there's other folks who have things that they wanna talk about around big data, please do join the big data SIG and make one more pitch for signing up. It's commons.openship.org slash SIG OpenShift big data. And you can find that easily in sign up and you'll get that mail-in list going and find some dates to do the next four parts of this series. So thanks again, Will. And thank you everybody for joining us today. And since it's 2 a.m. or now 3 a.m. where a few of you are, I hope you can catch some sleep. Take care. Thank you, Diane.