 Okay, we're back. I'm Dave Vellante with theCUBE and you're watching Evolving InfluxDB into the smart data platform, made possible by InfluxData. Ana-East Otis Giorgio is here. She's a developer advocate for InfluxData and we're going to dig into the rationale and value contribution behind several open source technologies that InfluxDB is leveraging to increase the granularity of time series analysis and bring the world of data into real-time analytics. Ana-East, welcome to the program. Thanks for coming on. Hi, thank you so much. It's a pleasure to be here. Oh, you're very welcome. Okay, so IOX is being touted as this next-gen open source core for InfluxDB. And my understanding is that it leverages in memory, of course, for speed. It's a columnar store, so it gives you compression efficiency. It's going to give you faster query speeds. It's going to store files and object storage so you've got a very cost-effective approach. Are these the salient points on the platform? I know there are probably dozens of other features, but what are the high-level value points that people should understand? Sure, that's a great question. So some of the main requirements that IOX is trying to achieve and some of the most impressive ones to me, the first one is that it aims to have no limits on cardinality and also allow you to write any kind of event data that you want, whether that's live tag or a field. It also wants to deliver the best in-class performance on analytics queries, in addition to our already well-served metrics queries. We also want to have operator control over memory usage. So you should be able to define how much memory is used for buffering, caching, and query processing. Some other really important parts is the ability to have bulk data export and import, super useful. Also, broader ecosystem compatibility were possible. We aim to use and embrace emerging standards in the data analytics ecosystem and have compatibility with things like SQL, Python, and maybe even pandas in the future. Okay, so a lot there. Now we talked to Brian about how you're using Rust, and which is not a new programming language. And of course, we had some drama around Rust during the pandemic with Mozilla layoffs, but the formation of the Rust Foundation really addressed any of those concerns. You got big guns like Amazon and Google and Microsoft throwing their collective weights behind it. It's really, the adoption is really starting to get steep on the S curve. So lots of platforms, lots of adoption with Rust. But why Rust as an alternative to say C++, for example? Sure, that's a great question. So Rust was chosen because of its exceptional performance and reliability. So while Rust is syntactically similar to C++, and it has similar performance, it also compiles to a native code like C++. But unlike C++, it also has much better memory safety. So memory safety is protection against bugs or security vulnerabilities that lead to excessive memory usage or memory leaks. And Rust achieves this memory safety due to its like innovative type system. Additionally, it doesn't allow for dangling pointers and dangling pointers are the main classes of errors that lead to exploitable security vulnerabilities in languages like C++. So Rust like helps meet that requirement of having no limits on cardinality, for example, because we're also using the Rust implementation of Apache Arrow and this control of her memory and also Rust's packaging system called crates.io offers everything that you need out of the box to have features like async and a weight, to fix race conditions, to protect against buffering overflows and to ensure thread safe async caching structures as well. So essentially it's just like has all the control, all the fine grain control you need to take advantage of memory and all your resources as well as possible so that you can handle those really, really high cardinality use cases. Yeah, and the more I learned about the new engine and the platform, IOX, et cetera, you see things like the old days, not even today, you do a lot of garbage collection in these systems and there's an inverse impact relative to performance. So it looks like the community is modernizing the platform but I want to talk about Apache Arrow for a moment. It's designed to address the constraints that are associated with analyzing large data sets. We know that, but please explain why. What is Arrow and what does it bring to influx DB? Sure, yeah. So Arrow is a framework for defining in memory, columnar data and so much of the efficiency and performance of IOX comes from taking advantage of columnar data structures. And I will, if you don't mind, take a moment to kind of illustrate why columnar data structures are so valuable. Let's pretend that we are gathering field data about the temperature in our room and also maybe the temperature of our stove. And in our table, we have those two temperature values as well as maybe a measurement value, timestamp value, maybe some other tag values that describe what room and what house, et cetera, we're getting this data from. And so you can picture this table where we have like two rows with the two temperature values for both our room and the stove. Well, usually our room temperature is regulated so those values don't change very often. So when you have column oriented storage, essentially you take each row, each column and group it together. And so if that's the case and you're just taking temperature values from the room and a lot of those temperature values are the same, then you might be able to imagine how equal values will then neighbor each other. And when they neighbor each other in the storage format, this provides a really perfect opportunity for cheap compression. And then this cheap compression enables high cardinality use cases. It also enables for faster scan rates. So if you wanted to find like the min and max value of the temperature in the room across a thousand different points, you only have to get those a thousand different points in order to answer that question. And you have those immediately available to you. But let's contrast this with a row oriented storage solution instead so that we can understand better the benefits of column oriented storage. So if you had a row oriented storage you'd first have to look at every field like the temperature in the room and the temperature of the stove. You'd have to go across every tag value that maybe describes where the room is located or what model the stove is and every timestamp. You then have to pluck out that one temperature value that you want at that one timestamp and do that for every single row. So you're scanning across a ton more data. And that's why row oriented doesn't provide the same efficiency as columnar. And Apache Arrow is in memory columnar data framework. So that's where a lot of the advantages come from. Okay, so you basically described like a traditional database, a row approach. But I've seen like a lot of traditional databases say, okay, now we can handle columnar format versus what you're talking about is really kind of native. Is it not as effective, is the former not as effective because it's largely a bolt on? Can you like elucidate on that front? Yeah, it's not as effective because you have more expensive compression and because you can't scan across the values as quickly. And so that's pretty much the main reasons why row oriented storage isn't as efficient as columnar oriented storage. Got it. So let's talk about arrow data fusion. What is data fusion? I know it's written in Rust, but what does it bring to the table here? Sure, so it's an extensible query execution framework and it uses arrow as it's in memory format. So the way that it helps influx DB IOX is that, okay, it's great if you can write unlimited amount of cardinality into influx DB IOX, but if you don't have a query engine that can successfully query that data, then I don't know how much value it is for you. So data fusion helps enable the query process and transformation of that data. It also has a Pandas API so that you could take advantage of Pandas data frames as well and all of the machine learning tools associated with Pandas. Okay, you're also leveraging a parquet in the platform because we heard a lot about parquet in the middle of the last decade because there's a storage format to improve on Hadoop column stores. What are you doing with parquet and why is it important? Sure, so parquet is the column oriented durable file format. So it's important because it'll enable bulk important bulk export. It has compatibility with Python and Pandas so it supports a broader ecosystem. Parquet files also take very little disk space and they're faster to scan because again, they're column oriented. In particular, I think parquet files are like 16 times cheaper than CSV files just as kind of a point of reference. And so that's essentially a lot of the benefits of parquet. Got it, very popular. So, Anise, what exactly is Influx data focusing on as a committer to these projects? What is your focus? What's the value that you're bringing to the community? Sure, so Influx DB first has contributed a lot of different things to the Apache ecosystem. For example, they contribute an implementation of Apache Arrow and Go and that will support querying with Flux. Also, there has been a quite a few contributions to data fusion for things like memory optimization and support of additional SQL features like support for timestamp arithmetic and support for exist clauses and support for memory control. So, yeah, Influx has contributed a lot to the Apache ecosystem and continues to do so. And I think kind of the idea here is that if you can improve these upstream projects and then the long-term strategy here is that the more you contribute and build those up, then the more you will perpetuate that cycle of improvement and the more we will invest in our own project as well. So it's just that kind of symbiotic relationship and appreciation of the open source community. Yeah, got it. You got that virtuous cycle going, the people called the flywheel. Give us your last thoughts and kind of summarize what the big takeaways are from your perspective. So I think the big takeaway is that Influx data is doing a lot of really exciting things with Influx DB IOCs. And I really encourage if you are interested in learning more about the technologies that Influx is leveraging to produce IOCs, the challenges associated with it and all of the hard work questions. And I just want to learn more then I would encourage you to go to the monthly tech talks and community office hours and they are on every second Wednesday of the month at 8.30 a.m. Pacific time. There's also a community forums and a community site channel. Look for the Influx DB underscore IOCs channel specifically to learn more about how to join those office hours and those monthly tech talks as well as ask any questions they have about IOCs, what to expect and what you'd like to learn more about. As a developer advocate, I want to answer your questions. So if there's a particular technology or stack that you want to dive deeper into and want more explanation about how Influx DB leverages it to build IOCs, I will be really excited to produce content on that topic for you. Yeah, that's awesome. You guys have a really rich community, collaborate with your peers, solve problems and you go super responsive. So really appreciate that. All right, thank you so much Anais for explaining all this open source stuff to the audience and why it's important to the future of data. Thank you, I really appreciate it. All right, you're very welcome. Okay, stay right there and in a moment I'll be back with Tim Yocum. He's the director of engineering for Influx data. We're going to talk about how you update a SAS engine while the plane is flying at 30,000 feet. You don't want to miss this.