 I think we're good. Is this? Yes. Okay. We're gonna move this. Yes. Hello. Hello everyone. Thank you for coming to Metcalf Small. My name is Nandan. I'm here to introduce our speakers, Alex Hovind and Landon LaSmith. They're gonna be talking about scaling your open data hub for fun and production. Please welcome them. Thank you. Thank you. Do you want to share your places? Uh... I'll drive for you. Test, test, test. Hi. So I'm Landon LaSmith. I am on the team that's primarily responsible for working on the open data hub. I'm passionate about Linux and all things containers. And unlike Alex, I am not a beekeeper, but I like honey. Alex is much more interesting and I'll let him introduce himself. Cool. Yeah, so I'm Alex Corvin. Like Landon, I'm a software engineer here at Red Hat. I work sort of on the open data hub, but I'm primarily focused on running Red Hat's internal instance of the open data hub. I'm kind of like a systems engineering devops, SREs, I reliability engineering kind of guy. That's what gets me excited. So like scaling systems, that sort of thing. And yeah, unlike Landon, as he said, I am a beekeeper. So, you know, if you want to talk about bees after this, yes? Do you like honey? I don't really like honey. It's weird. I like having projects and I get bored if I don't have something to keep me busy. So, I saw a post on Reddit one day about beekeeping and I thought it seemed interesting, so now I'm a beekeeper. But yeah, that's who I am. So what you can expect from this talk is we'll just introduce the open data hub. And we'll talk about fun versus production with the heavy emphasis on production, which is Alex's expertise. Challenges that we faced while scaling our internal instance of the data hub and then answer any questions. So what is the open data hub? So first and foremost, it's a reference architecture for running data and machine learning projects on OpenShift. So we want to develop a community where we can give you best practices for how to deploy different tools on OpenShift environments in OpenShift environments. And our primary way of doing that is development of a meta operator. So a meta operator is essentially we are an operator that will handle the deployment of other operators and custom resources in OpenShift. We're not trying to control the whole stack. We just want to make it easier for DevOps, data scientists, data engineers to deploy an environment on OpenShift. So data hubs. So at a high level, these are kind of the core parts that we're focusing on. So in the DevOps, we have support for monitoring, optimization, and model serving. The data engineer, we primarily focus on data storage using self-object storage. So you can use your favorite kind of S3 commands, AWS, CLI, to push that data and access it. And right now, the Open Data Hub is focused on making it easier for data scientists to perform their workflow on OpenShift. So the Open Data Hub is primarily focused on the yellow and green parts for right now. Alex is working on the internal data hub whose core focus has been on the blue and green with some intersection for the data scientists. So this is a very busy diagram of the low level focus of the Open Data Hub. At the top, you'll see we have different tools for AI machine learning. Right now, we are focusing on Jupyter Hub to allow the data scientists to interact with different tools. We have the Open Data Hub and AI library for ML applications. AI library is a collection of different models that you can import or use to, I guess, model your data. Selden is currently in the Open Data Hub so you can create models and a few other things. Alex right now is working on the data analysis portion, metadata management, and I think storage also. So as the Open Data Hub grows, we are working to bring a lot of those internal data hub features that we've been using in production into the Open Data Hub. So the Open Data Hub is fun. This is as little as it can get. We want to make it easy to allow you to deploy different tools into your environment. So it's modular, so if you want Jupyter Hub and Spark, but you don't care about monitoring, you can do that. I think we have an instance of it running internally where we're only focused on the Spark operator for deployment without Jupyter Hub, but it's completely up to the user when they deploy it. So we want to make it easy to simplify the deployment and redeployment of your tools. So if you want to play around with different tools or data sets, you can do that. If you screw something up, we'll let you wipe it out and redeploy it and just make it really simply and easy. So that's the fun part. We're kind of focused on allowing you to easily deploy everything you need, and if you screw it up, just burn it down and start it back up. How do we plan for production? So this will just be kind of like a quick talk. Alex has all the good information about how they actually do it in production, but with the Open Data Hub, we enforce modularity. So if one product breaks, we don't want it to bring down the whole stack. That's one of kind of our core tenets for adding new components to the Open Data Hub. So a component is something, the individual piece that we're deploying. That could be Ceph Object Store, Kafka, Prometheus Grafana, Jupyter Hub, Spark operator, and anything that we add in the future. Reproducible behavior, so we want to make sure that if this does grow into a production system, that you can easily redeploy it with the minimum amount of effort and have everything work. And the key things we're trying to focus on to make it ready for production is to get as many use cases as possible, because we want to know what breaks we can make the Open Data Hub better. That's probably the most important thing that we're asking for for production. So the Open Data Hub is a community. So we want as many people to use it, many people to contribute new components so that we can figure out the best way to make it better. All right, so thank you, Landon, for giving the intro to the Open Data Hub. I'm going to real quick give you an intro to what do I do on the internal data hub? What does my team do on the internal data hub? Can you guys hear me okay? I feel like I'm not doing a very good job of speaking into this, so let me know if I start breaking out. So first and foremost, the internal data hub, I mean internal at Red Hat, we just call this the data hub, right, is a platform for enabling teams at Red Hat and environment for enabling teams at Red Hat to do what we call become data centric, right? We want teams to have an environment where they can store their data and where they can reliably get access to that data and where they can explore and implement actual production workloads of various data science tools, right? So we maintain this environment, we make sure that it's stable and reliable and meets their needs. And then a really big part of what we do is reaching out an enablement to these teams to make sure that they know how to use our systems so that they know how to use the components of the open data hub, so Jupyter Hub, Spark, what have you, and guide them through and tailor like our walkthroughs, our tutorials, our guides for actual meaningful use cases for teams at Red Hat, right? So that's one really big part of our charter. The other thing that we do is we're kind of a proving ground for the open data hub itself. Kind of the model we've been working with is if there's a new component that we want to add to the open data hub reference architecture, usually we'll run it internally first and make sure that we understand how to run this thing and we can do it so reliably. We kind of work out all the kinks before recommending to the world that they run this in their environments, right? So we're kind of the guinea pig, if you will, for some of these components of the data hub. And I'll get more in, I think, the next slide, actually, about some of those things we're working on right now. So we're a proving ground, but also related but an important facet dimension is we are kind of the customer zero initial point where all these components get to run at scale. So we're not crazy scale right now, but we're solid scale for you 100 gigs of data running through our system a day. It gives us an environment with which to really prove out, again, that these components can actually work in a production environment. So I mentioned that, you know, we, you know, new components to the open data hub typically go through us first. I wanted to touch on a few different things we're working on right now. The first is Kafka. Malakir back in the room has kind of been owning that. That, you know, will actually will be a big part of my talk later on. But we wanted to, you know, prior to Kafka in the open data hub, our processes did not really scale. We didn't really build like a very effective streaming data like platform, right? It was really hard to scale our processes and scale the architecture and scale how teams sent data to us and got access to it, right? Kafka has been really helpful there and it's like a very strong recommendation that will make to users of the open data hub when building out their data like platforms, right? Is leverage Kafka for that? And again, I'll get more into that. Anyways though, Kafka, I think it's in master now, right? Like you can deploy Kafka with the open data hub operator. I think that will be officially released and out on like Quay for you to pool in a couple weeks, right? End of the month. So the next thing that we're working on, we call this kind of overall like we lump them all together in what we're calling the data catalog. Specifically though, there's three components. One is Hive. I'm sorry. One is Hue. One is the Hive Metastore and one is the Spark Thrift Server. So together, these do a couple kind of cool key things. One is they give you a kind of a a pane or a window through which to query your data and explore your data and move your data using familiar tools specifically SQL. So your data scientists, data analysts who are familiar with executing SQL queries can do so on your data now that's stored in like your S3 data like using traditional SQL tools. The other thing that it lets you do is build out like rich sets of metadata about all of the different data sets that you have and store like sample data and schema data and you know arbitrary catalog information that you want about like who owns the data or what is the SLA for this data, what is the volume of data, et cetera. Store that in a database kind of alongside your data and then get really easy access to it through Hue. So the Thrift Server provides the ability to execute SQL, the Hive Metastore execute provides the ability to store metadata and Hue provides the ability to explore all of that. So these are three components that we've been leveraging now for a little while to build out abstract transform load, I think. Pipelines internally here at Red Hat to get our data into a place where the data scientists can then work on it. Kind of working out the kinks, it's working pretty well now though internally and that's another thing that we're hoping to have kind of out there publicly available for use at the end of the month. So finally, the third thing we've been working on now for a while internally is Elastic Search. We'll talk more about Elastic Search later on but we have a lot of teams internally at Red Hat who run their systems in production, right, typically on OpenShift and these systems generate a lot of log data and, you know, if a production issue happens you want to be able to get access to these logs to see what happens. The problem is OpenShift pods are ephemeral and if the pod logs are only stored with the pod it can become really difficult to get access to those logs because the pod is not there anymore. So we implemented Elastic Search, this has been around for a little while but we've kind of recently wrangled the beast if you will and are still working on that, but it provides a really solid platform, Elastic Search with Kibana to visualize them for groups to be able to send us their operational logs and view them in pretty much real time and see exactly what's going on and find like if an event is happening across multiple pods or whatever, there's a lot of powerful stuff with it. So Elastic Search, I don't think is officially on the roadmap right now for adding to the OpenData Hub operator but I think it's something we would like to do in the future. But so that's something we work on a lot internally is making sure that that's ready for prime time and that could be like easily installed. As Landon said, we want to make sure that components of the OpenData Hub are really simple to get going with so we work on that a lot. We want to get Elastic Search to that kind of place before we'll officially make a part of it. So anyways, the full roadmap to the OpenData Hub operator is there if you want to check it out and see exactly what we're doing and when we plan to have it done. So enough of that, now we're getting kind of like the exciting stuff, right? I wanted to talk about some challenges that we've faced internally while scaling the OpenData Hub. My hope is that this will not be too specific to the OpenData Hub that like if you don't plan to run the OpenData Hub you'll still get some value from it. Some of it might be specific to running like a large big data platform or a data lake but hopefully you'll be able to get some kind of generic nuggets out of this to use in your own production systems. And there are I believe four specific areas that I want to talk about, four specific challenges and how we're kind of tackling them. So the first one is monitoring. If you came to my last talk you know that this is something I feel passionate about that like you should not feel comfortable calling a production service like production ready if it's not monitored and if you don't have alerts in place and if your team does not like understand what to do with those alerts, right? So this is something that we've spent a lot of time on and one of the cool things about the OpenData Hub operator is you get Prometheus and Grafana monitoring for free and so like you know my team's job is to make sure that the internal data hub is like stable and reliable and available and a big part of how we do that is monitoring. We typically do that with Prometheus and then Prometheus is alert manager component which lets you pool metrics and then generate alerts based on those metrics and those alerts to however you want to get alerts, page or duty, email, whatever Slack. So first thing I'll say is like if you're running an application on Kubernetes or OpenShift, highly recommend that you play with Prometheus it integrates natively with Kubernetes and makes it really easier to pool metrics from your systems it's really easy to add custom metrics to your application if you have a custom Python app or API or whatever you run something more standard like maybe you run a database server or a patchy web server or something that's common. There are a multitude of public open libraries and they're called exporters in the Prometheus language but that will give you a lot of really rich metrics from your application without having to spend time becoming a super expert on whatever that application is to know exactly how to monitor, right? So that's the first thing is leverage Prometheus to be able to monitor your applications. So the other thing that I wanted to say that has taken us time to learn is just making sure that the team knows what the metrics and resources and alerts are rather. I think in the olden days we would deploy a service to production monitoring, maybe not a lot of monitoring and not necessarily taking the time to really understand the application and know what I actually need to monitor. This has been a learning experience for us with Spark and Jupyter Hub recently. We've had Spark and Jupyter up-deployed for a little while internally but it was kind of a wild west. I almost hesitate to call it a production service. We didn't really know what to monitor about it, how to effectively monitor, what was starting on what to do when alerts generated. So that's really the second big piece of advice I have for you here is if you're going to run something in production take the time to understand how to run it. If it's a database you can leverage for example the MySQL Explorer for Prometheus so you don't have to do all the monitoring yourself but take the time to research what do these metrics mean what do I need to care about and what's just kind of white noise, right? So that's what I have to say about monitoring and certainly I'm passionate about monitoring so if anybody wants to come up and talk to me more about it afterwards feel free. So the second area and this area I hopefully will have maybe some more nuggets, right? It's something that like my team knows we've worked really hard on and this is about scaling our processes so when we first deployed the open data hub internally, the data hub we were kind of in user acquisition mode, right? So the important thing was we get it running we get as many useful services running in it and then we work really hard to get teams at Red Hat using it and we were successful with that and I think a lot of teams might be, this might be a familiar story like you have success you get a lot of people running using your system and then more and more people will start coming to you trying to use your system and the problem is until now you've been so focused on doing whatever it takes to get people using your system it turns out you don't have good repeatable scalable processes for onboarding new users so for us we have a couple different storage platforms one is our CFS3 data lake one is Elasticsearch they have different authentication methods they have different patterns for writing data they have reading data they have different processes for adding a new user they're just very different and that made onboarding new users really difficult in Elasticsearch's example we had to add new certificates to the Elasticsearch system with read and write permissions to whatever sets of indices then we had to do a production rollout of Elasticsearch which took literally two days to do to add the new user and then for S3 there was this whole other process where you've got to get credentials you've got to assign users to buckets you've got to keep track of all that stuff it was very hard and ultimately what happened was a request for a new onboarding thing would come in and we'd see it and we'd say oh man that process is a pain to do I don't feel like doing it I'm just going to let that sit for a little while maybe somebody else on the team will do it and then two weeks later that request has been sitting there forever we better take care of that team morale suffers nobody wants to be doing this stuff it's not fun you're spending all your time doing onboarding stuff it takes a long time so now your team looks bad the important maintenance test don't get addressed so an example there is I'll get more in depth in details on this in a second but at some point in the history of our system we started implementing controls for how long data would be kept around in our system the problem was if we didn't have our processes well defined those life cycle policies wouldn't get applied so now the system would start growing kind of unfettered and would get just in worse and worse health over time because we didn't have systems to adhere to similarly our scaling considerations wouldn't get properly applied if there was a unique data set or a particularly large data set we would have to do a particular process to onboard them again without well defined processes that wouldn't happen without well defined processes we wouldn't know what data was stored in our system or importantly if there's personal identifiable information in that are there sensitive data sets we just have all these things that when we didn't have well defined processes for onboarding new systems a lot of stuff went by the wayside over time it just became harder and harder to manage the advice here is at some point you have to figure that out at some point you have to sit back and do the work and maybe it's painful but you have to do the work to figure out what the processes are get together as a team write down everything you have to do make sure everybody understands the process make sure it's documented make sure everybody on the team can do the process be really really explicit with what the steps are maybe you realize that like wow this process is really bad and this process is really hard and it does not work but you have it documented and then you can start to improve it and then maybe you can start to automate it that's kind of the phase that we're at internally right now it's been a little hard to get to but I think we're at a good place now instead of new user onboarding taking weeks it takes a couple of hours maybe and that's been really good as a result we're going to be able to go out and tell them this is how you should do it and that's kind of our goal scaling our processes that's been a big challenge the next big challenge that we've been working to overcome is sort of streamlining our architecture so I'm going to talk through this diagram a little bit but let me first say that if you're running a production data lake as your data lake grows and as you have more and more data sets going into it in our experience the logic for where that data goes or what to do with that data can become more and more complex in our case in our system we can store data in Elasticsearch or we can store data in our CFS3 data lake traditionally Elasticsearch has been used for things like log streaming and log analysis and CFS3 has been used for more data science data analysis kind of tools we're using for that traditionally a lot of the tools we use are really good at working with AWS and so CFS3 provides an S3 API that these tools can work with if a team wants to do data science we push them towards S3 if they want to do log streaming we push them towards Elasticsearch but there are cases where the team wants to do both long term we'd like to do something better but right now we just write it to both so there's duplication that's okay sometimes there are transformations that have to happen on this data before they get sent to their final resting place so there are different processes kind of different layers of processes that handle that and again going back to before we had this streamlined architecture it seemed like every time we added a new data source we just added a new component to the data hub that would do that new thing and it became very unwieldy and very a lot of duplicated work and just like hard to maintain that's kind of a theme here is it became hard to maintain so what we did is we introduced Kafka so Dashbox, Kafka, Strimsy Kafka Connect I think there have been other talks here maybe specifically Strimsy Operator if you want to run Kafka in Kubernetes or OpenShift and you don't do it with Kafka or if you don't do it with Strimsy use Strimsy it's the way to do it Strimsy is an operator that runs Kafka in OpenShift or Kubernetes so anyways we run Kafka in front of the data hub and all new writes, all writes of data go through Kafka and then it lets us do whatever we want so like if the logic is pretty simple about where it goes to it goes to Elasticsearch, it goes to S3 like we have very generic ingestion layers that will handle that so the data producers just basically like write it to the right place in Kafka with the right key our system handles where it goes if it's more complex we can have a very consistent layer of we call them normalizers to do that those transformations that are necessary and so they just read from Kafka right back to the right place it ends up in Elasticsearch or S3 right and so the benefit to this is that we don't have to worry about all the different like complexities of where data ends up we can do things in a very like generic way that will scale and then we just have to worry about the stuff that's like from Kafka back all of our data producers if they want to start sending us data we can create a topic for them in Kafka we can tell them where to write it where the writing is on them so and the other nice thing is that Kafka Kafka is very easy to scale you can add topics you can add topic partitions you can add consumers and that's maybe not the case with some of the backend things like Elasticsearch or S3 so if we have a burst of data we can scale up Kafka or messages can kind of queue up in Kafka and it gives us a lot of resiliency that we may not have on the backend so recommendation here I'm a big fan of Kafka if you're not familiar with Kafka as an enterprise message brokers with data streaming system like if you have large amounts of data if you have complex logic for where data should go or if you have complex logic for transformation of data highly recommend you use Kafka and so again back to the open data hub we spend a lot of time to kind of understand how to run Kafka and how to architect Kafka as part of a data streaming platform and we're really excited to be adding it to the open data hub operator and be able to provide it to the world to run as part of their data lakes on open shift okay so the next thing I want to talk about is managing the volume of data how am I doing on time? what do I have? 15 more minutes? cool so for this I put if you build it they will come I don't know if it's a baseball movie they'll watch it it's a good movie better than the rest of us talked so for us this has been really true I mentioned that for a long time we were working really hard on getting the data hub to be a stable platform where people can write their data to and I think we were successful with that and now we're struggling with our own success how do we make the data hub we're operating at scale now so that just brings all sorts of problems or challenges rather I should say if you build a shared data platform people are going to want to send their data to it and so one of the problems we've had is I think people have a way of sending you data and then kind of forgetting about it or you don't really know what data you have or that data is only useful for a period of time when you keep it around forever so three specific things I'll say here one is introduce controls for limiting how long your data is kept so introduce data retention policies this is really easy to do with Elastic Search there's a tool called Curator you can use to expire old indices it's really easy to do with CFS3 you can implement data life cycle policies utilize both and honestly for us we're not quite here yet it's not a strict default but we recommend have default policies for how long data will be kept in your system and adhere to them and require a good reason not to adhere to them for us we like to default to keeping data around for 30 days we can be flexible on that but it forces people to think about do I really need this data around and honestly sometimes there are use cases where there are regulatory requirements that you don't keep that data around forever requiring people to think about how long they store that makes them think it's better for everybody you spend less on storage it's easier to scale because you don't have to scale as much it's just easier so strict data retention policies is the first thing the second thing I'll say that has been really important for us in managing the volume of data is implementing a way to keep track of the data that's in your system so at some point if you have a bunch of people sending data to you it's impossible without a system to know what data do you have the problem like if you don't know what data you have then like why are you storing the data you want your data scientists or data analysts to be able to explore what data is available and be able to do interesting things on it you don't want 50 teams at your company to all be storing the same set of data like there's so many benefits to being able to know what data you have in your system and when you're working with hundreds of gigs or more of data a day you can't keep track of that manually so this is where the hue and high of meta store and spark thrift server solution that I mentioned earlier comes in we think it's going to be a really scalable mechanism for keeping track of all the data and allowing people to easily explore it so if you're running a data lake have a tool like that and then the last thing here that I want to talk about is keeping track of who owns data we can lump all this together under data governance who owns the data who has access to the data is there any sensitive information stored in this data who's the data about is it about a system is it about a customer is it about a random person it's probably just one bullet point up there but it's very related to just keeping track of the data it's important to introduce these data governance policies again we think that the data catalog solution we're working on will really help with this but honestly like this is an area that we're still not fully mature yet internally and it's something we're working on if you're working on a data lake and you have experience with this we'd love to talk to you about it but it's certainly an important area to consider and I think we're with things like GDPR and CCPA with the California Consumer Privacy Act that's similar to GDPR that's coming out it's becoming very important that you know what data you have everybody's concerned about privacy you have to know what data is in your system and you have to know who owns the data and you have to know if it's subject to regulatory approvals or if it's processes if there are processes defined for getting access to the who can have access to the how do you delete the data how do you update the data when your system starts to scale you have to have processes in place for knowing all this stuff and keeping track of it you're not going to be successful so finally I think the last kind of challenge area that I wanted to talk about is organizing data second and last is organizing data efficiently so this is going to get kind of into the nitty-gritty about Elasticsearch FFS3 for a second again when we initially deployed the data hub when we were scaling up we just kind of wrote data wherever over time we've learned that that was a bad idea specifically for FFS3 we had like so FFS3 if you're not familiar with it you organize data into what are called buckets initially we had like five or six maybe buckets that we just wrote everything to really really big recently we started having issues with our FFS3 server and access started getting really slow the system became very unreliable request for like timey out processes were failing we looked into it and the first thing we saw was that the one problem child bucket had something like 15 million objects in it we started talking to step engineering and to red hat support and we learned that there's kind of this you know if you're in step engineering or support kind of a widely known rule of thumb that a step bucket should not have more than one and a half million objects in it so we were you know order of magnitude over that recommendation that has been fun to deal with I think this was back in like March we learned about this I have a process I can pull it up right now and I can put it around because we're having to move data into different buckets and batch data and combine objects and it has been a pain so my recommendation here is like if you're storing data in S3 we have learned one that buckets are cheap like don't be afraid to create multiple buckets and like come up with a plan for putting your data in these different buckets that's recommendation one it's really easy create different buckets in addition to that or if you can't do that come up with a plan for just like how in like the actual objects what you're going to store in those objects right so one recommendation that we have is for data that is read frequently maybe this is new data as it comes in like in the log example these are today's logs or this week's logs if you need to access those a lot data you access frequently or that is written frequently or whatever store that in maybe dedicated buckets or maybe store that as a bunch of smaller objects and you can access them really quickly right but then in the background or over time or at a regular interval or whatever you want to do go through and reprocess those so that's what we're doing right now is we're taking all these really old millions of objects and combining them into a significantly smaller number of substantially larger objects that way you reduce the total number of objects that you have doing operations like listing objects becomes very fast and you don't really have to access that data that often so if there's a little bit more of an overhead to getting to the data like I think that's that's okay like that's an okay trade off in my book so that's the biggest concern we've had around organizing data in S3 we have had a similar problem with elastic search we haven't fully solved this yet but I think we know how to solve it we've kind of put a bandaid on the situation background on elastic search real quick elastic search stores data in indices indices are split into shards nodes in elastic search cluster has to keep track of all these shards in like metadata right so you can know where data is stored all of that metadata is stored in memory in elastic search and there are limits to how many shards elastic search recommends that you store on an individual node before it starts performing badly that's based on the amount of memory that elastic search nodes has and then the JVM has a maximum amount of RAM heap space that it can keep track of so there's a maximum to the number of shards that you can have on elastic search nodes and it ends up being somewhere in the neighborhood of 600 data stored in our system is stored based on indices and those indices a really easy pattern so you want to limit the number of indices that you have you want to limit the size of indices otherwise it takes a really long time to clear your data a really easy pattern to do in the long example is just create a new index for every single day so a lot of people do that it's really easy the problem with that is maybe you have one data set that generates gigs and gigs and gigs of data in a day and then maybe you have another data set that generates a few kilobytes of data in a day both of those indices are going to get rolled over every day and so over time what you have is a bunch of shards of wildly disparate sizes and that's like it's inefficient right it's not optimal remember that there's a limit to the number of shards that you can have you want to optimize the sides of those shards and you want to optimize the number and the allocation of those shards if you're wasting those precious shard limits with a bunch of really really small shards you're going to have to either massively scale up your cluster or the cluster performance is going to suffer so similar to how you have to plan for bucket placement or object placement in S3 you need to plan for your sharding pattern in elastic search I mentioned a lot of people just roll over the indices every day that's really easy to do it's kind of the low cost easy lazy way better way though that we've discovered is using what's called the elastic search roll over API again this is kind of super elastic search technical if you're using elastic search though you should explore this basically what you do is you specify a size that you want your shards to be and it does not create a new index unless your shards hit that threshold so it allows you to across the board in a really easy scalable way manage the size of your shards optimize the size of your elastic search cluster that you actually have in the so organize your data efficiently and the final area that I want to talk about is just running in containers so containers are obviously by nature ephemeral and sometimes it feels a little bit like fitting a square peg in a round hole the opposite of that though because I think that would actually work a circular peg in a square hole so like the fact that we run a data lake this is all very persistent applications that have persistent storage everything about it is persistent it's a little bit at odds within ephemeral container a couple things I guess three things we've learned that can help you kind of reconcile this the first is use operators so Kubernetes operators are containers that manage an application basically in a nutshell right and typically these operators are developed by people who know the application I have three minutes left I gotta go fast leverage their work don't put all the time into figuring out what it takes to run Kafka in OpenShift when the Strimsy team has already done all that work for you leverage their work do what they've taken all the guesswork out of it before you use what they've done all this stuff is open source then use operators don't reinvent the wheel two is plan for redundancy Kafka, Strimsy if you're working with those on a more generic level OpenShift have a few different mechanisms you can leverage for like fault tolerance or high availability or redundancy that kind of thing three specific ones are with Kafka partition replicas that's not OpenShift specific but replicate your data you can do the same thing with Elasticsearch plan for failure plan for things to be ephemeral create copies of it two is pod anti-affinity Elasticsearch makes it really easy to control on what underlying OpenShift I'm sorry OpenShift makes it really easy to control on what underlying OpenShift nodes your pod runs leverage that don't put all your applications in the same basket put them all on different nodes finally in Kafka you can use something what's called node placement that allows you to put your topics on different nodes I'm close right let me wrap this up you can do similar things with Elasticsearch to control which nodes your Elasticsearch charge run on and then finally the last thing I wanted to talk about were leverage available storage options so OpenShift gives you the ability to use host path mount so if you have something that requires really really fast storage you can use local NVMe storage have that right locally in your OpenShift pods or you can use something like Ceph which is container native storage that works really well so just tailor your needs to the kind of storage you have available to you again we can talk more about that if I need and I don't actually know what time this ends but that's the end of what I have I can take questions if you have them do I have time for questions? cool I have time for questions for the Kafka when you do the normalize where did you get the normalize inside of Kafka like when you write in the topic so yeah so right now we're going to use something called LogStash over time we'll probably use Kafka connectors more and maybe in some cases do like custom Python daemons or whatever language but the pattern is read data so data goes into Kafka in like an ingestion topic our normalizers and then on the back end there's like an official like final level topic that is the last stop before everything gets dumped into elastic search or stuff in the middle these normalizers can run pull data from the front line ingestion topic do what it needs write it to that final line topic where it gets dumped blindly into elastic search or stuff thank you thanks for your talk did you ever consider having like the people that use the hop to manage their own S3 like give me your S3 credentials and it's your problem and not yours so we do give S3 credentials out to individual users so I think you're asking about we connect to their own S3 clusters right we can certainly do that and the tools that we give them like Jupyter Hub or Spark are typically like S3 agnostic like if they have their data stored in AWS S3 they could just connect it to it the problem with making the actual like ours S3 cluster kind of Wild West and letting people do whatever they want is one bad actor and we've found bring down the whole thing so the issue we had is I mentioned we had that one big bucket with 15 million objects but what happened is people were not going to queue to query on that bucket their client would eventually time out but in the back end in SEF that query would keep running the client they'd think oh my command failed I better run this again so in the back end you have all these really expensive queries running that don't finish they bring down the whole cluster I think that's something that hopefully will get fixed this experience has really let us work with the SEF engineering team a lot to try and make this stuff better I think just out of this experience of learning about SEF there's at least two bugs that have been fixed in SEF so hopefully this will get better but right now everybody kind of has to play nice cool alright well thank you everyone