 And here we go. Hello and welcome. My name is Shannon Kemp and I'm the Chief Digital Manager of DataVercity. We'd like to thank you for joining this DataVercity webinar accelerating queries on Cloud Data Lake sponsored today by Alexio. Just a couple of points to get us started. Due to the large number of people that attend these sessions, you will be muted during the webinar. For questions, we will be collecting them via the Q&A in the bottom right-hand corner of your screen. Or if you'd like to tweet, we encourage you to share highlights or questions via Twitter using hashtag DataVercity. And if you'd like to chat with us or with each other, we certainly encourage you to do so. Just click the chat icon in the bottom middle of your screen for that feature. And if you'd like to continue the conversation after the webinar, you can continue the networking at community.datavercity.net. As always, we will send a follow-up email within two business days containing links to the slides, the recording of this session, and additional information requested throughout the webinar. Now let me introduce to you our speaker for today, Alex Ma. Alex is the Director of Solutions Engineering at Alexio and an open-source veteran. Prior to Alexio, he worked for CouchBase where he was the Director of Solutions, Engineering, and Principal Architect. And with that, I will give the floor to Alex to get today's webinar started. Hello and welcome. Hey, Shannon. Thank you for that intro. So, yeah, like we talked about, what we're going to go into today is how to accelerate queries on data lakes in the cloud. And so I think first off, maybe we'll start off with some of the things that we've noticed, talk about some of the challenges and trends that we've noticed in general production environments and talk about some ways to make these things a little bit more addressable. And so I think the first thing is talk a little bit about cloud, right? So everyone's in the cloud, everyone wants to get into the cloud. And so what we ended up noticing is that a lot of people are leveraging cloud in really, really interesting ways. And we ended up referring to it internally as hybrid cloud. And so what is a hybrid cloud? A hybrid cloud is really a scenario where you are leveraging cloud in any number of different capacities. You're leveraging it for compute. You're leveraging it for CPU or GPU horsepower. You're leveraging it for the big data applications that run in that environment and the accessibility that they provide to your users. And a lot of times you're doing this against data that maybe you've lifted into the cloud or maybe data that's still living in your existing data lake in your data center. And so there's a lot of reasons why these scenarios exist. And these scenarios are things that you wouldn't necessarily run into five or ten years ago, just because the bandwidth wasn't there, the technology wasn't there. But today we start to see these things. And so in terms of why someone might do this in the first place, there's a couple of different reasons. One is they're looking for efficiency and time to production. And so what they're looking at is they may talk to their data center team and say, hey, we're going to blow out a new application. We need some resource. And they may be looking at a three to six month lead time to actually get that going, to provision it, to order it, all those typical things that you might have to do within the data center. And so moving into a cloud model, obviously drastically shortens that. You're talking about seconds, minutes, hours, maybe even days at the worst case to get additional capacity for these kinds of workloads. So being able to get to production faster compared to trying to do this on-prem is one reason why cloud is very attractive for people. Another reason is because it's not a fixed resource. So with cloud, we can go up and down in terms of the amount of compute resource that we're using. And so I'll give you an example. We have a lot of customers where they run these analytical workloads at the end of the month. And at the end of the month, there is a very large demand for this data and it is very time sensitive. And so they need a lot of CPU horsepower for that. And to be honest, this only happens towards the end of the month. And so in a traditional environment, you'd have to provision all of that beforehand, let it sit idle for the duration of the month, or find other workloads for it and then do that such that you'd have the capacity when you needed it. But again, in the cloud, it's a different equation. You could shut the entire thing off, run it in kind of an idle configuration and simply expand that compute when needed, as needed. And so it provides a lot of flexibility for these kinds of things. I think the last thing for why people want to get into cloud and why they're even running these hybrid cloud environments is because it becomes kind of an intermediary step before they finally get into the cloud. So we'll notice that customers that we talk to are in one of three categories. They're already fully in the cloud leveraging those benefits. They're in the data center today and thinking about getting there or they're running a mix. And the mix is usually they're in the migratory process of getting there. That process happens to take more than a day, obviously. So sometimes we talk to customers and this process takes several months, if not several years to fully migrate all of those workloads. And so what ends up happening in the meantime is that they end up running this configuration that we call the hybrid cloud. So that's kind of setting the tone for some of these things. Why obviously people are leveraging cloud and what exactly a hybrid cloud is. Let's look at some of the challenges and approaches to getting to a hybrid cloud. And so part of this is how am I going to actually run these workloads in the cloud? If I'm running Spark workloads, if I'm running Presto, if I'm running, I guess I'm naming off the specific technologies, but if I'm running ETL batch conversions, if I'm running ad hoc queries, if I'm running analysis from my data scientists, if I'm running machine learning workloads, how do I stitch all of that together while leveraging cloud resources? And part of that equation means making the data that I have that's so useful accessible to these workloads in the cloud environment. And so there's a bunch of different ways that we can do this. We can copy the data by the workload. We can do a full lift and shift or we can leverage caching in the cloud. And each of these approaches have their own benefits as well as their own issues. And so if we take a look at some of these things and we're going to be talking about these for traditional data lakes that might be running in a Hadoop environment, I guess the first thing that you might look at is how do I get the data up there? And so you might leverage things like distributed CP in the Hadoop world to get that data up in the cloud and accessible by whatever it is you're trying to leverage. But the challenge with this is that a lot of times it's hard to identify exactly what it is that you want to lift up there. So if you want to get a given workload up and running in the cloud in the next three months, you've got to identify exactly what it is that workload needs, which part of the dataset. And if it's missing some of the dataset, is it okay that the results are skewed, that they're off, or that the thing doesn't run because it's missing that data? You have to make that kind of determination. And if you can't make that determination, what you're left with is copying a large amount of data to be safe, right? To make sure that you can still run that, right? If you're talking about lifting and shifting the entire workload, that's also a little bit challenging, right? If we're taking an application that has traditionally run in the data center for the last five years, we're saying, hey, we're just going to move this straight over to EC2 or GCP. That's not always the best way to go about it, obviously, because some things you have to think about in terms of a different paradigm. And one is optimizing for the cloud itself. And without doing that, it can be a little bit tricky. It can be a little bit expensive, right? You have to start thinking about things like, hey, if I lift this entire thing, what about all the processes that I still have in the data center that still need access to this data? And so it becomes a little bit challenging in terms of just migrating the entire thing up into the cloud. The last thing we look at is leveraging in our cache. Why don't I just use something like a Redis? Why don't I use S3 to cache the data in the cloud, just so that I can continue to leverage all that compute horsepower in the cloud providers' environments? This is great. Again, this kind of points back to the first challenge, copying data by workload. This is great if you can identify exactly what data you have. And it's great if the workloads are read-only, but it's, again, a little bit problematic if you have data that needs to be synced back to the data center. You have to figure out how to do that and add in additional application logic to get that accomplished. You have to figure out how to maybe identify which pieces of data have changed, how to sync the deltas, how to keep in track with what's going on in the production data lake in your data center. Lots of things to think about in terms of being able to leverage these resources, but also being able to make data accessible to these resources. What we're going to look at is a solution from Eluxio that we call Zero Copy Burst to allow you to leverage the cloud to scale these applications. Like we're talking about, we have our data lake that is running in HDFS and it's bound on some kind of resource. It is maybe compute bound, it's out of VCPU, maybe it's IO bound, maybe it's bound on just the environment could be faster, you want to leverage some other resource. And so we've got our Direct Connect and we're set up against AWS, but some challenges start to arise. I am leveraging a Gigabit Ethernet or 10 Gigabit Ethernet connection down to HDFS and the latency and the bandwidth may not be sufficient for what I am trying to do. If I'm running an ad hoc query through Presto and that query looks at three terabytes of data as part of its question to answer a given question, now doing that every single time over a 10 gig pipe, it just very quickly becomes an unscalable solution. And again, like we've just outlined, copying data to the cloud has its own challenges. It's difficult to maintain those copies. Once I get those copies of data into the cloud, I still, if I'm in an industry where security and governance are important, I'm charged then with how do I keep this data secure when I'm running it in a data lake in the cloud. And so there are challenges with that as well. And so, you know, with Alexio, what we're able to do is we're able to help make this a workable solution so that you can accelerate your queries in the cloud against your data lake. And so what it allows for is it allows access from the cloud to your data lake that lives on-prem. And what it allows for is it allows Alexio to handle the orchestration of pulling the data as it's requested into the cloud, caching it and, you know, removing the overhead of having to manually define datasets to copy, removing the overhead of having to manually define processes to copy this data, keep it in sync with what's going on, and it provides local performance in the cloud for these applications. And so other benefits are that, you know, it scales, you know, in a cloud-native way, right? And so these applications that you might be running, these big data applications, if they're running in an ETS, in an EMR, what have you, Alexio is able to scale and work with, you know, how things flex up and down in a cloud environment. Additionally, benefiting you is that a lot of that workload is enabled without adding additional IO workload to the data lake or without adding additional compute requirements on the existing data lake because we've shifted this all over to the cloud, and now we're able to leverage the cloud itself. The last thing to point out as part of this solution that gets interesting is that not only does it enable you to run and improve the performance of queries running against your data lake that's on-prem, but it also allows you to start migrating and populating a data lake in the cloud if that's the end goal that you have. And so it benefits organizations from what we've seen by not only enabling these workloads in the first place, right? But a lot of times these workloads are driven by hardcore analysts or hardcore data scientists, right? They're used to writing Spark jobs in Scala, or they're used to writing DirectSQL to query things with Presto. But a lot of times there's a lot of business consumers that would love to have access to this, but the tools don't exist in-house in the data center to provide easy access. And so the second part of this is that Alexio can actually also be used to not only enable the workload, but also to help with migration of the data into a cloud data lake. And once it's in that data lake, you can start taking advantage of that data by making it accessible to more of your consumers with a broad array of tools that are enabled, that are available in the cloud environment for a variety of different consumers. So if you're trying to do machine learning, you could leverage things like SageMaker and AWS. You can train models. You can do all kinds of things that are more accessible to a wider variety of audiences than those that are just simply used to writing Python against TensorFlow or DirectJobs against Spark and Presto. And so that's an additional benefit for Alexio. Let's take a look at a few examples of people that are doing this, and then we'll go into the nuts and bolts of how Alexio actually helps out with enabling some of these things. And so I think the first thing to look at is Development Bank of Singapore, DBS. And so they're a large bank based out of Singapore, and they're leveraging Alexio for a number of different things. They're leveraging Alexio for a unified game space. They have multiple different data stores, data silos, if you will. Multiple HDFS clusters. They have data stored in S3. And for some of these applications, they simply need a way to kind of make all of this data transparently accessible to their applications and they actually do that through Alexio. They leverage it for object stores. Object stores are a great way to store data, large, large amounts of data in a scalable way. But there are a number of things that start to become a little challenging when you start leveraging object stores for these kinds of workloads. And so things like metadata operations, renames and moves, listing files, listing directories, all of those become a little bit more tricky against an object store than a traditional file-based store like HDFS. And so Alexio helps out dramatically with that. And lastly, they use Alexio for the exact thing that we've been talking about. Being able to leverage AWS resources against data that lives in their data center to do model training and be able to sync those results of the trained models back to their on-premise data center. And so this is kind of what it looks like. They have on the left their internal Singapore data center and in there they run a lot of different big data frameworks. They run Spark on Yarn, they run Presto, they have Alexio running their multiple HDFS clusters. And connected to that through Direct Connect is another set of workloads. This is running Alexio again. It's running AWS EMR for both Spark and Presto. And they're doing analytics and machine learning on there. And they're also leveraging AWS or Amazon SageMaker to help train with these models as well. And so what they're able to do is take their 50 terabytes or so of historical data and voice data and be able to use that to train models in the cloud based on up-to-date information and take that to understand better how the customer's journey is going so that they can support them better when they call in. And so all of this is enabled through this year combination of Alexio. It just makes things a little bit more efficient, makes the entire overall solution a viable one in the first place. Walmart is another great example of this hybrid cloud leveraging compute resource in the cloud. Now, this is a different cloud. This is running in Google Cloud platform and seeing kind of thing, right? And so in this case, they've got multiple large HDFS clusters that live in the various Walmart data centers. And they have decided that, hey, we need to offer querying as a service for a lot of our consumers. And to hit that scale, what they've decided to do is actually leverage GCP. And what this does is it supports around 3,200 users across 40 business groups for ad hoc querying and analytics. And the only reason, again, that it scales is that they're able to leverage Google Cloud platform and leverage Alexio to pull in the necessary data, cash it into the cloud, and improve query performance without having to do direct access to the HDFS data lake every single time. Additionally, what they're able to do is, again, leverage that capability of Alexio to have the query dictate what data is hot and leverage Alexio to migrate or start, not migrate, but start populating the data lake in Google Cloud Store. And once again, that data is accessible there, they can actually look at it with other tools as well. So not only do they look at it with Presto, but it's also available through other direct Google technologies, things like BigQuery, so that, again, it's available for a wider audience because the data is accessible. Last use case is a customer that we've worked with where they ran out of room in their data center. And so this is a hybrid cloud configuration, but on-prem, and it's a hybrid cloud in the sense that the challenge is exactly the same, even though it's not in a given cloud provider. Large 30-petabyte Hadoop cluster that's running within one data center that they're physically maxed out of capacity on and physically oversubscribed on CPU on. And so their analysts and their business users have a need for ad hoc querying. And so what they were able to rig up is essentially they're able to leverage capacity in the news site to deploy a large number of machines running Presto and Eluxio. And Presto and Eluxio are able to cache the data from their production Hadoop warehouse, or Data Lake, and use that to provide query performance that's like the two are running within the same site. And so what it allowed them to do is not heavily invest in the network, be able to manage this new workload, and not have to figure out an alternate solution that involves construction in the existing data center. And what we're looking at here are some of the query times. What we're looking at is Presto and Eluxio compared to Spark SQL. This graph is actually missing something. What it's missing is the original numbers from when this was running within the same data center. But the performance that they are getting is essentially 3x of what their prior solution was. So being able to enable these new solutions where these things are completely remote is something that Eluxio can be very helpful with. So we spent a lot of time talking about the challenges, the desire to move to the cloud, high level about what Eluxio is, and a little bit about some of the customers that are actually using it. Let's dig in just a little bit about the technology and give you a sense for how Eluxio does any of these things that we've been talking about over the last 25 minutes or so. And so with Eluxio, I would say there are a couple of key innovations that actually when you combine them together, give you this capability to run data orchestration for a hybrid cloud. And those things are data locality, data accessibility, and data elasticity. And so what we'll do is we'll kind of dig into what each one of these things mean and give you some additional context. First, data locality with intelligent multi-tearing. Really what that is is the Eluxio worker process is typically going to be installed co-located with whatever big data framework you're trying to enhance. And so it could be Apache Spark, it could be Presto, it could be TensorFlow, but typically there will be unused resource on that machine that we can actually leverage to make that framework run better. And so Eluxio is very configurable. You can actually specify, hey, these Spark executors, they're configured with this amount of memory. We'd like to allocate 10 gigs from each of these nodes to Eluxio and we'll allocate it as RAM. Whereas with something like Presto, that might be very memory hungry, you might say, hey, Presto is going to leverage all the memory for its JVM heap for the queries that it's doing, but there's a bunch of SSDs on these machines. So we can actually define or configure Eluxio so that it can leverage those as a tier for storing and caching data local to that Presto worker. And so it's very configurable. You can tell it, hey, I've got memory, I've got SSD, or I've got spinning disk, and I want to allocate some of it to Eluxio. And Presto and these frameworks will essentially request data. Eluxio will reach over to the data lake where that data lives and pull that data down locally. And the first time it does it, it's going to be the exact same as if you didn't have Eluxio installed, but from then on, every single interaction with that piece of data is going to be very quick because it's no longer reaching out to the remote data lake. Now it's talking to the local machine, not doing any network IO, and able to service it from the local memory SSD or HDD. You know, one of the things that's really interesting is that Intel is a strong partner of Eluxio, and they match up perfectly with this kind of architecture. What they've done is they've come out with a couple of new technologies. And the one we're going to look at today is called Optane Persistent Memory. And it's a new class of memory and storage technology, and what it allows you to do is essentially have a layer that, you know, I think of as that sits between direct DRAM on the machine and something like an NVMe, right? And it gives you, again, another option to store very large amounts of data in memory for a lower price than just, you know, filling up the machine with RAM. And so, again, it gives you really interesting options. Here's some specs for, you know, what the Optane DC Persistent Memory chips are available as. But once you tie that in together, what you're able to do is essentially you're able to configure Eluxio with multiple different tiers. And so, you know, for our customers where they're storing, you know, 300 terabytes of data, it may not be feasible to store all of that in, you know, direct DRAM. But the Intel Optane Persistent Memory solution gives them an option where they can actually leverage, you know, a good chunk of that in a more economical way than just filling up DRAM. And it gives them additional performance on top of something like an NVMe SSD. So a really interesting option from our partners at Intel. But one more thing to consider in terms of what you could do with Eluxio in these kinds of environments. The second innovation, coming back off that tangent for why Eluxio enables these things is accessibility of data, right? It's one thing to have a solution that works with just, you know, that's spark specific or that might be, you know, specific to a given framework. But what Eluxio does is it actually has a middleman, if you will, in between a very large number of big data frameworks and between a very large number of places where you might store that data. And it handles the accessibility and the translation between those two constituents. And so what we're able to do is sit in the middle and offer interfaces to all of the big data frameworks so that they're able to talk to Eluxio in the way that is most efficient for them. And we also have drivers on the southbound side so that we're able to connect to any of these different storage layers. And so through Eluxio, you might have some of these things that we're not, you know, it may not be easy to configure Spark to talk to S3 as well as HDFS as well as a Minio object store. But it's very easy to configure to talk to Eluxio. And Eluxio is very easy to configure and it can bridge the translation between all those and give you a single place to configure access to all of these different things. Unified namespace is another part of this, right? Part of running these workloads is having accessibility data, right? And what Eluxio does is instead of saying, you know, hey, you know, this piece, this chunk of data is stored on this data lake and this other piece is stored on S3, what you're able to do is have Eluxio mount all of those different objects, all of those different storage layers and present a single unified interface such that your application sees data very similar to the view that you have here, right? You have a path and along that path you have access to different points of data. And from an application standpoint, you don't really need to care that, you know, this access key to get to it or hey, this is on Google Cloud Store and I need this interoperability token to get to it or hey, this is located on HDFS and I need to configure my application with these key tabs and these principles. So all of that is kind of abstracted away and the application developers can focus on, you know, the business logic that they're trying to implement in their application, not the infrastructure concerns of where this data is, how do I get access to it? But in another way, this is kind of another view of it, a luxury to access through a URI very similar to HDFS or S3 and data is located along a path. And so what you can see here is we actually have two things going. At the root mount point, what we've mounted is a HDFS data lake and within that file system hierarchy there is a users folder with Alice and Bob within it. We also have a nested mount point that points to an S3 bucket and that's mounted as slash data within Eluxio and that represents reports and sales. And so for an application that's accessing this data what it's able to do is, again, not really care where the data itself is located. It can simply access data along this Eluxio scheme to fetch the data that it needs. When it comes time to start populating that data lake what we're actually able to do is we're actually able to define policies and we're able to do so in such a way so that we can say, hey I want to mount my HDFS on-prem data lake in slash data but I also want that mounted against S3 and so you actually configure Eluxio to mount both of these storage layers to the same location and then define a policy such that hey as you access data through this mount point if data is read or written what I want you to do is after X amount of time I want you to copy or move that data to the second storage layer. So as an example if I run a spark job against a given folder within HDFS after three days that data might automatically get copied to S3 or it might automatically move to S3 depending on what it is I've defined as an Eluxio policy and really again what that allows us to do is not have to pre-identify exactly which datasets are being commonly accessed it allows us to get this data into the cloud so that other applications can leverage it again without the manual operational overhead of having to copy make sure it's in sync all those different concerns. Okay so in terms of Eluxio the way it works is actually very similar to HDFS where you have a name node and data node and so in this example what we've got is a kind of typical reference architecture for Eluxio we've got a very large number of machines running Presto we've got a very large number of machines running Spark and we're running Eluxio workers co-located on all of those and so those Eluxio workers are going to connect to the storage layers and they're going to leverage some of the leftover space on that machine from a memory standpoint an SSD standpoint or a spinning standpoint whatever is available to cache some of this data locally and avoid having to make round trips when these applications want data and so the way it works is that either of these applications will talk to an Eluxio client to make the request and they might say hey you know I'm looking for this piece of data Eluxio client will go to the Eluxio master and see if that data exists within any of the workers that make up the local cluster and it does great it'll service it and it'll service it from the closest worker if it doesn't it will designate a worker that should be responsible for that chunk of data and that worker will reach out to where that data lives it might be HGFS, it might be the object store fetch that data and cache it locally and so very very simple architecture very elastic you know it's very easy to add capacity in this case and if a worker goes missing the Eluxio master is able to say hey I no longer have access to that piece of data we're going to have to fetch it from HGFS and store it on a new worker and very efficient to run in cloud architectures in terms of interacting with Eluxio it's also very very simple you know depending on the kind of application that you are working with you're going to access it and exactly the way that you might access you know HGFS or SRE today with a URI in something like Presto there are a few very easy ways to work with Eluxio we've made it seamless much much much more seamless in the last year such that you can be leveraging Eluxio with your normal queries and not even know it it's got to that point so a lot of different ways to interact with Eluxio but these are just you know some visuals for what it actually looks like when you're actually working with these applications okay so we spent some time you know talking about you know the challenges of accelerated queries against cloud data lakes we spent some time talking about hybrid cloud environments and we've spent some time talking about Eluxio let's open it up now for some questions you know anything that you're curious about anything that you might want to dig into let's let's leave it open now for that so Shannon do we have anything interesting we do and just to answer the most commonly asked questions just a reminder I will send a follow-up email by end of day Thursday with links to the slides and links to the recording of this presentation so diving in here Alex I assume the first user gets pretty bad performance do you have stage queries to you run to warm up the cache yeah so that's a great question I wouldn't necessarily characterize that it's pretty bad performance and I guess pretty bad depends on how you're defining things but you know the first user that is making a request for that kind of data is going to see the exact same kind of performance that they would see if they didn't have Eluxio in the environment right and so if you're doing a query that looks at three terabytes of data over a one gig link yeah I would definitely define that as terrible so you're not wrong there but to answer your question yeah you know Eluxio is is using a couple different ways so some users will pre-stage the queries they'll know that hey the you know this user is going to do this kind of operation against this partition of the table so we're going to do a select star you know with the predicate set so that it's going to hit you know the most common things that users going to request and that's one way to do it from an application standpoint and so application or the user can stage that before beforehand but what you can also do in Eluxio is you can actually load that preemptively and so you can actually run a command that says hey you know I want to load all this data preemptively along this path and so in one way you're doing it kind of more application centric and having that application translate what data is requested to Eluxio and the other way you're telling Eluxio directly hey data on this path is hot and I want you to load it and so you can do that in either case but the end result is that that first user that's running our query is going to have a much better experience because that data is already cached in a local worker in the Eluxio cluster. And what tools are Eluxio built on? So Eluxio is not so much I mean Eluxio built on some tools there are you know different components that make up the framework itself we leverage things like gRPC and Netty you know we're not going to build our own communication protocols for things like Java but you know to answer your question more broadly Eluxio is an open source technology that was built from the ground up and it was built as part of a PhD project from the Berkeley hand plan and originally it was built to be the off heap persistence layer for Apache Spark and so it started off as a project probably five years ago and it's grown dramatically since then in terms of you know the number of people contributing to it on GitHub and also from what the product can do from just a functionality perspective. So now you know it's not dedicated only to Apache Spark but ties in with all these other technologies. But to answer your question you know Eluxio is an open source technology it's built on top of Java and yeah hopefully that helps. Indeed Eluxio, does the cache get cleared on a daily or monthly basis and repopulated starting with the new query day or month? Yeah so that's also a great question. So it's configurable so you can set time to live for data in the cache you can do that along specific paths so you might say hey data that's accessed over here is kind of archival data. I want you to remove from the cache after 24 hours or you know hey data in the cache here is kind of more medium term remove it after 30 days. You also have the option of pinning that data so that it's always in a given tier in Eluxio. So you have a lot of different configuration options and I would say you know kind of run the system for a little bit see what the usage patterns are like see what data is hot but you know there's a number of different ways that you can tune this and it comes from eviction algorithms, pinning strategies kind of live strategies but there's a number of different ways to configure it. And what are the maximums for Eluxio cluster? So the maximums I mean you can look at maximums in a very large number of ways right. Data density, cluster size throughput we'll try to give you some maximums for each of those. We have some clusters that are running in large Chinese companies that are running I think at somewhere around 1600 nodes in a cluster. And so that's obviously a very large Spark cluster that's running that has Eluxio co-located with it and from in your account you know that's the largest that I know of off the top of my head. From a data density standpoint I forget the exact numbers but we have some clusters that are provisioned with capacity for multiple single digit petabytes. And so you know those clusters actually have multiple tiers so they have a memory tier and a disk tier and it kind of depends on what you're looking to do with it but essentially you're building up a cluster of many machines and you're leveraging the resource of those machines. So if you've got multiple terabytes of disk on each machine and a few hundred gigabytes of RAM you have the capacity for a lot of potential space. And lastly I think from a throughput standpoint in terms of maximums a lot of it's IC driven and so the number of threads that are requesting data is really going to drive the throughput but we have a few users that are what I would call interface saturated and so really what that means is that they're able to drive enough traffic from a client perspective that occasionally they will hit interface saturation on the machine that Alexio is running on. And so what they've done in those cases is bond with multiple interfaces together to give them additional network capacity. Obviously they're leveraging the memory tier for Alexio or leveraging an NVMe or Intel Octane persistent memory tier to get that kind of throughput on the machine that they're saturating network interface but along those three dimensions that should give you a sense of what the maximums Alexio is capable of or has been seen in production environments. And Alex, what third party databases and other products are supported by Alexio? Could you repeat that one more time? So what third party databases and other products are supported by Alexio? Yeah, so that's also a good question and I think that's also one of the areas where people tend to get confused with Alexio itself and so Alexio is not a database per se or database management system. It ties into it connects a lot of big data frameworks with a lot of storage layers and the thing I understand is if you look at this bottom layer everything in here is essentially file based and so we're looking at multiple different kinds of cloud object stores we're looking at things like HDFS we're looking at things like ad apps and a couple of other things but again everything is kind of file oriented and so the database functionality that we tie into usually it's the one that's going to be leveraging Spark SQL or something like Presto or Hive and they're going to be leveraging it against or C files or parquet files or those types of things right not against the traditional database system like an Oracle or MySQL and so that's one really important thing to understand with Alexio it's focused on files and in the big data world that actually maps to a file storage format for data but it's not necessarily focused on databases. How is unstructured data quality streaming handled including black data? I'm going to need a little bit more context if the user that asked that question could add just a little bit more information I could probably help respond and maybe we can move on to the next one until then. Yeah, go ahead. Yes, so while I say that let me jump to the next question so it's how security handled between data storage layers and file systems? So also a good question again everything with Alexio is fairly flexible so the basic answer to your question is whether define its own policies or it can tie into whatever it is that you're using to define your authorization and authentication policies and so commonly we get a lot of requests for things like Apache Ranger LDAP Kerberos and so we tie in with all those mechanisms for identifying hey this user is the user they are and this user is allowed to do this kind of operation and if you don't have one of the systems to tie into you can actually just use Alexio as is and so Alexio goes without any of those things Alexio goes on a basic POSIX framework for security from a file system and authorization standpoint and so it can leverage the OS users and groups and define permissions with the UNIX application set so the rewrite execute for user group and world and so it's able to tie into either mechanism for when you're just testing Alexio it makes a lot of sense just to use the default but then as you move into production it may make a lot more sense to tie into Ranger or tie it into active directory or LDAP Perfect and can you do machine learning Alexio what I mean is if it can select what type of data to store in process I'm not 100% sure what data signal processing is but I mean we do have a very large number of users that are doing machine learning in a variety of different ways and so they're using technologies like Apache Spark they're using things like TensorFlow they're using things like SageMaker they're using things like Presto which is kind of surprising to me because I don't think it was a machine learning technology but to answer the question you know they're using all of these different frameworks to do machine learning to train models to do those sorts of things and all of these frameworks are just accessing data along a given path and so you know the developer or whoever is managing that application telling it hey this is where the training data set is located or you know here's where I want you to output this data set that data set that location happens to be an Alexio path and so Alexio is not doing anything specific to decide what data to service it's more the person that is running those frameworks is saying hey this is my training data set this is my verification data set those sorts of things great thanks Alex and going back to the previous question you asked for more information on the original question was how is unstructured data quality streaming handled including black data and to expand on that during the unstructured data insertion into a data lake how the real-time streaming of data handled thereby also handling the unused black or log data considering dealing with telecom network data okay gotcha so you know it's an interesting question you know how do I how do I get data into the warehouse or into the lake if it's coming from logs if it's coming from you know JSON files if it's coming from CSV and so you know Alexio itself is you know not really going to do anything super special for that you're still typically going to need a framework to help transform that data that's coming in right and so you might have something like Apache spark that's doing ETL against that data that's coming in that's doing a little bit of cleansing on it before it goes into the lake right you might be using spark or you might be using Kafka in those cases what can be helpful is that Alexio can act as kind of a temporary memory buffer for writes and so a lot of times what happens is that data as it's as it's getting cleansed you know if you will as it comes in from an unstructured to more and more structured before it gets into the lake you know there are going to be multiple operations against that data and so Alexio can act as a temporary memory buffer where you're able to do those writes and handle a very large volume of data and be able to process it and use it as a temporary storage location so that it can pass from phase to phase of that cleansing process and so you know we've seen users where you know the cleansing process starts as something like you know multiple 300 terabytes of you know just raw data and an HPC cluster will process that first and then it will move to multiple different spark jobs and then you know somewhere along the end of that you know along the end of the you know multiple operations that data now it's ready for you know a TensorFlow machine learning workload right but you know Alexio is not specifically tied to that but it can be used you know to help with those kinds of things I love it and I love all these great questions coming in just back to machine learning Alex you know how many machine learning techniques can Alexio take the question was how many machine learning techniques correct yeah again it's really really simple I mean the number of machine learning techniques is going to be just limited by the framework that you're using and so the things that I see commonly for machine learning PyTorch TensorFlow, SparkML and probably a couple other things that I'm blanking on but all of the techniques are going to be available you know are going to be whatever is available in those frameworks the thing about machine learning that's interesting is a lot of it's you know based on training right and so the better data that you have the more of that data that you have the better and more accurate and faster a model you come out with right and so what we've seen is that a lot of times these models you know it's not like you run the thing once a lot of times what we see in the enterprise environment is that these things go through a thousand iterations of the training run before you actually get to an efficient model and at each point at each iteration you're tweaking various parameters and you're looking for you know what is going to what is going to align best with the data that I have and so as part of that right you are leveraging Alexio to keep that data in memory and close to the GPU or close to whatever it is that you're doing the training with such that it's able to trade much more efficiently and so again to answer the question you're you're going to have available whatever techniques are available for the given framework that you're using but the number of models that you're able to generate is going to be substantially higher because you're going to be getting better utilization out of that training framework. We have one customer that after they start using Alexio they have this process they go through of you know cleansing filtering and then training data training against data and they're able to with Alexio generate four times as many models per year as they could without it and that just kind of speaks to you know hey we're doing all these advanced things but we also need a platform where you know we're doing something that's very interesting and very read heavy and having something like Alexio just make that process a lot more efficient so hopefully that helps. What are some of the hardware requirements for Alexio worker nodes? So I think we have the minimum up on our website and if I had to guess it would probably be something like four vCPU, eight gigabyte of RAM network connection you know we have Alexio run across a very wide span of hardware and so in our own testing environments we run it in containers with a slightly lower spec than that because we're doing large cluster simulation and multi-node testing. We have some customers that have this deployed on machines with I think just under 100 CPU cores and probably close to a terabyte of RAM and multiple petabytes of NVMe. So it kind of goes all over the place in terms of the machine spec. And I would really ask is what framework are you using and what resources is it bound on? And so if you're using Spark it might want for memory and CPU and you might say hey I've got a little bit of memory left over on this machine I can allocate that Alexio and or I've got some LSD lying around. Alexio the workers themselves are not going to be terribly compute heavy which is great because all these big data frameworks are. And so really what Alexio is going to want from a minimum requirement is whatever you have available from the machine that's left over. And in some cases for given workload and for given requirement it may make sense to say well hey we're going to have a, we're going to pop up the physical capacity of this machine because we have defined multiple terabytes of hot data and we have this many nodes and we're going to use Alexio to store it and so we're going to do the math for what makes sense. But the minimum is pretty small the maximum is fairly large but you should look at what is in your environment and what's left over and start there and see if it makes sense to increase it more. I love it, great answer. So do you distinguish between unstructured and non-structured data? Well Alexio itself doesn't and it is going to serve as a bridging layer to provide more seamless access to data to cast that data locally and to possibly help you migrate and start populating that data lake in the cloud. And so Alexio is to be quite honestly a very agnostic about data and so the application it might be Spark or might be Presto it's going to request a given piece of data it's going to request a file but Alexio doesn't really know much about that but Alexio is bridging the connection to where that file is located where that data is stored and it is optimizing that by reducing the overhead for where it's located how to connect to it those sorts of things and making the access to that file more efficient by caching it locally for that application that's requesting it. So no to be honest with you with structured and semi-structured data it's more about if your application is able to work with that. Alright well I think we have time to slip in one more question here Alex. So the tough question of the day you've given us so many good reasons to use Alexio and what a great product it is what problems have you noticed using it and what are the contraindications for usage? Where is it not appropriate? Okay, so let's see here so the question is where is it not appropriate and what was the first part of that question? What challenges have I noticed with notice? Okay, gotcha let's see here that is a tough question. So let's start with the easier part of that question and give me a second to think about that where is it not appropriate? So you know I think we touched on this a little bit earlier but you know a lot of users will think of it as a ways to accelerate queries against remote data lakes and not many people make a I mean I should say not many people but some people make the distinction between a sequel based data lake you know something like a snowflake or an aurora or an oracle cluster and things like a you know how to do data lake and those are obviously very very different things and you have some technologies like presto that don't really care they can ask questions of either but alexio is definitely more focused again on the file aspect of this and so where to say it's not really appropriate is you know if you're leveraging any kind of ODBC or JDBC connectivity to get to the data itself alexio may not be a fit for you right it's going to be focused on connecting your application to an HDFS data lake a data lake that lives on an object store and a data lake that maybe lives off of benefits right and so I say that's the first thing remember that and you know save yourself some time trying to integrate it with an oracle for the next question let's see here what challenges have we seen with it you know anything in the big data world it can be it can be challenging just because there are so many moving parts right so sometimes you add a component into this and you know you don't quite know what to look at to troubleshoot what's going on and so what will say is that you know look at the default settings alexio and understand what it is that you're configuring the default options for alexio are things like writing only the memory as opposed to writing to the storage layer the default options allow you to have multiple copies of the data which is going to take up additional room and memory the default options you know there's no default option for where you're storing data but obviously understand that you know if you give alexio access to a place to store data and that data is you know a spinning disk EBS volume that is not going to be quite as performing as giving it memory so I would say you know where it causes problems is is maybe with some of the default options adding it into a big data environment and kind of not knowing what area to look at and then having the correct expectations for what it's going to do and so I don't think these challenges are you know specific to something like alexio you add anything into an environment where you know you have a few dozen machines it adds a layer of complexity to it and so to soften that you have to really understand the technology or at least have a basic understanding of it and have the correct expectations for what it's doing and so I'd say the problems that we have or maybe where some of those things are not aligned and that's probably the best I can do with that additional detail Alex I love it this has been so good and what a great presentation and thanks so much for this and thanks all our attendees for being so engaged in everything we do love all the great questions just a reminder I will send a follow-up email by end of day Thursday with links to the slides and links to the recording of this session again thank you so much for attending and Alex thank you so much for this great presentation hope everybody stays safe out there thanks for the time everyone thanks all