 Hi, we're here at the IBM Storage Summit and we're digging into the requirements and challenges that modern lake houses put on storage systems and how diverse data types can be handled. Virtually seamlessly, Vincent Shu is back on theCUBE, he's an IBM Fellow and a VP and also the CTO of IBM Storage. We're going to talk about storage fusion, how it accelerates IBM, Watson X data queries and we're going to get into some of the trends and data. Vincent, it's good to see you again. Thanks for coming on. Okay, very nice to see you. So you guys have some pretty interesting claims here, but first, let's start with sort of the Watson X technology. What are the key characteristics, the salient points that we should know about? You know, Watson, let me just focus on Watson X data. I mean, the data management piece of it. I mean, it has the three salient point them, a sweet salient point of Watson X data. The number one is support the multi engines. As you can see that, you know, that in order to accomplish the complex task, you often require the multiple, you know, databases and a query engine to work together. So Watson X data will be a support multi engines. The second one is IBM, Watson X data is supporting the open data format. We are champion to leveraging the open data format to allow us, allow different kind of platforms to share data. The third one is the building many data managements and data governance to allow us to be able to truly harvest the insight of the data, be able to truly manage this very diverse data. And by open data formats, you're talking about iceberg tables, correct? All right, I'm going to talk about iceberg, yes. Okay, and so based on what you just said, how does that affect and or change the requirements from the storage system perspective? Ah, very good. So you think about what Watson X data tried to accomplish. They're trying to provide a platform, support multi engines and allow customer to be able to query data anywhere, any type, anytime. Okay, with consistent performance. So that put a new challenge in the storage layers. So what the storage need to do to accomplish that? First of all, as you know, the data can reside in HDFS, can reside in the file storage, object storage can be on-prem, can be on edge, can be in the cloud. So what we do is provide a storage virtualization layer, that this storage virtualization to be able to present consistent storage to the query engines. But in the backend, we are able to virtualize a very different type of storage, HDFS, files and objects. So that's number one. And the next one is accelerations. After we virtualize them, we will be able to cache the right data from whatever the storage, maybe they are the thousands of miles away in some object storage, be able to cache those important data to the local NVMe drive to support the real-time requirement for the query engine. So that's number two. And then number three is the multi-engine parts. Well, once you support multi-engine, not all the engine use the same storage interface. Some of them use the object interface, some of them use the path interface, and some of them use maybe even HDFS interface. So what we provide is a storage layer to allow people to allow all different engine to be able to access the same data through different engine and be able to share them. So improve the efficiency, improve the data sharing between the different applications. So when I think of different storage types, it also jumps into my head, or different workloads, maybe it's OLAP, streaming, graph data, transaction data, we're kind of talking about data now, and then different query options. Are you saying that you can take these different storage types and different query options and sort of translate them into things that a machine can understand in a consistent way so that I can deal with these different data types and storage formats? Correct. So for the query engines, it doesn't have to deal with all the kind of diverse storage anymore that either they are on-prem or in the cloud or the edge, it goes through this IBM storage technology, diffusion technology, we virtualize them, we accelerate the performance, and we provide the sharing capability across multiple engines. Now you've also, you're talking about some significant improvements in query performance, can you quantify that and tell us like how you achieve that? Yeah, so the experiment is like this, right? So if you have to query the data that not local to you, I'd say that you have a on-prem engines or the engine in the cloud, the data actually, first of all, let me back up here. So we all would like to have all the data close to the engine if you can, but that's not reality, right? Sometimes for various reasons that the data, you cannot move the data easily. So there's often that in order for us to query all the data in any place, from any place, any type, anywhere, that there are a lot of data that which is sort of remote to your query engines. So the storage technology, be able to detect that what is the right data and be able to cache those data from the remote data sources and catch the local NVMe drives to provide the very high performance. So in the lab, we have to see, we have seen the seven to 10X performance improvement by doing this caching. And on top of that, because it wasn't extra data, be able to change the data format from the whatever the original data format to the more efficient data format, if you will, sometimes, you know, the external object storage, maybe just the text format. Once they change to Parquet or even Iceberg, the performance is even more stunning. Once they get the right data format, the query engine can be another 10X performance improvement. What? Interesting. So actually, you say you get better performance in an open table format? Yep. What leads to that? What's the magic behind that, Vincent? The magic really is the, you know, they are the, what do you call that? The text format, you know, they are really not friendly for certain query operations. And we can build those metadata in the open data format to allow us to accelerate those query performance. Can we talk about governance a little bit? Governance and data sharing, it's obviously a hot topic. How are you approaching that? What kind of sort of promises are you making to customers in the context of governed data sharing? So we often see people that, you know, they have data governance on the structured data or unstructured data. And, but as you know, that today, when people are dealing with data problems, they need to deal with all the structure, semi-structure and unstructured data. And for various reasons, sometimes for the compliance, sometimes for the security reasons. So what we do is we integrate an IBM, IBM storage have a capability to cross through all the unstructured data. And we're building this capability to be able to detect the data changes in the unstructured data, like, you know, object storage, add a new object, file storage, you need a new file. We've been detected it and we build this metadata databases, if you will. And we combine those with a structured data view. So for IBM technology, you can see a single panel class to see all the distribution of all the data. And then you can apply the policy on those data to allow us to be able to perform some particular function. For example, remove the BII's or the Hayes speech, you know, the informations from the raw data sources. So the unique technology in the storage is be able to, you know, very efficiently cross those data and build the structured views to allow the higher level management software, IBM Watson technology to be able to manage them. It's okay. So I'm able to share data in a governed manner and I presume I can do that within my own organization. Is that, first of all, is that true? And then if I want to share it in my ecosystem, are we at the point where you can actually do that as well? The first answer to your question is yes, we will be able to share among our organizations. And right now we are looking at how do we standardize this metadata format and there are a bunch of activity happening in the industry to do that. And we definitely want to push that to allow, you know, that people be able to collaborate that right now. Seems like everybody has their own proprietary metadata management. But, you know, the next stage, I think next frontier will be everybody will be able everybody will be able to share those metadata. And that will lead to sort of open marketplaces. You know, Vincent, typically when I have these discussions with customers, they're frustrated because they have to make trade-offs. You could lower your cost, but it's going to have a performance impact. Or, you know, you might not get all the resiliency and the governance and the security, but you can lower your cost. Are you able to provide, you know, those three sort of dimensions, the cost, the performance and the security and resiliency and governance without trade-offs? And if so, how are you achieving that? Yeah, so excellent question, David. I mean, this is something that everybody want, right? Everybody want the high performance, most secure and lowest cost possible. And the, how we accomplish this is the IBM Storage Technology is going to leveraging the high-scale object storage, the hyper-cloud-scale object storage, be able to scale from edge to on-prem to the cloud and provide the most efficient store for the persistent storage. Okay, let's say that you have data in HDFS or some other storage, we're able to change the data format and then put on the object storage. But nevertheless, once the data is everywhere, then we need to be able to cache the essential data, the data you need in the caching layer to allow you to have the high performance, you apply the high-performance hardware capabilities. So what we do right now is we have, set storage as a persistent storage and using spectrum storage fusion technology to be able to cache the right data. By the way, there are a lot of very intricacy innovations in the caching the right place at the right time. And by the way, this caching is truly hands-off, if you will, that we can detect what data has changed in the remote site and remote location and we will only update the data there which is changing. This is called a global namespace. We're able to promote the right data to the high-performance tier. So you basically have a two tier, you have high-performance tier dealing with your data working set, your persistent tier, preserve, persist all your data. And IBM actually is developing a new technology that if you have a data that you know you are not going to use for a long time, but you do for whatever the reason, the compliance or whatever, you need to preserve for a long period of time, like seven years or 10 years and you need to store in the most sustainable ways that we have a technology to push down to those data to the tape. Of course, a lot of times when I talk about tape, people might say, hey, we don't know how to use those technology, but no worry, IBM is providing an object interface to the tape to allow people to be able to talk to the tape through object storage. To you, it's just a different type of object storage which is even more economic, more sustainable, is more efficient power efficiency and cost efficiency. But all this different tier of management, the trick is you don't have to, the user doesn't have to deal with any of those. You just issue your query engines and we will be able to apply the AI technologies, look at your assets pattern to make the right decision to move the data at the right place at the right time. You know, it's funny you mentioned tape. I mean, people may not realize, but all the hyperscalers use tape. Tape is out there, it's orders of magnitude less expensive, certainly at least an order of magnitude less expensive and allows you to do things that you can't necessarily do with disk. And it's interesting that you've got an object interface because that's going to actually hide essentially the back end. And so why does anybody care? I wanted to ask you about the global namespace that you talked about. Is that a highly consistent architecture or is it eventual consistency? We do have actually two different types of consistencies. We have a, you know, immediate consistency and we have eventual consistency will allow you to do that because based on different type would, this is what I'm saying, this is all the intricacy of the technology. There's their immediate consistency within the major data is are committed on the right place. And there's eventual consistency where eventually the data will be synchronized. So this is where you need to, based on your data type of different application. But the innovation, again, those are not a new concept that the innovation from IBM fusion is really, we are able to make the right decision for customer. The last thing you want to keep customer, a big long list of manual doesn't pick and choose and never start at the right place, select a different right button to operate these things. Nope, all these things is transparent to you. You just use it, we will make sure that those data at the right place at the right time, is it? Very interesting. And I think you alluded to this before, you can do this across clouds. Is that right? It's a common experience across cloud. Correct. Is it fair? So cross clouds on-prem, hybrid, it doesn't matter which cloud, right? Edge even eventually, maybe today. And an identical experience, is that right? Correct. Consistent experience across them. Is the deployment model a single global namespace or is it a instantiation of your stack in different physical locations? It's obviously, you don't want to form a single clusters across the world. You're going to have multiple clusters just for the failure domain. But those failure domain, those different clusters will form a sort of even higher level cluster. So we are able to synchronize the names across them. Interesting. I got to ask you about generative AI, everybody's talking about it. And I want to put it into the context of data apps. It seems like this is the AI, year of AI, AI heard around the world, however you want to say it. And we've been sort of thinking about this notion of real time data applications. We use Uber as an example. Basically taking a digital representation of your business, the data that's coming in, essentially informs your business as to its state and people want to work in real time. And they want to build this new breed of data apps and operate in real time. Again, the Uber example of riders, drivers, ETAs, destinations, pricing. But people, places, and things, the representation of your business. Of course, AI playing a huge role in terms of making that all coherent and serving the right data to the right person at the right time. Do you see that as a viable future? And if so, how do you see the storage requirements changing? I absolutely agree with you, that I agree with you, this is the future. And the challenge of, listen to who you just talking about, have all kind of different app and they have a different requirement and things like that. And you even talk about many times, those apps need to share data between them. They need to be able to pass data from one app to the other one, the other ones. So how do people accomplish it today without the more advanced storage technologies? Let's say that I have an app number one, I would just create data. And then in order to give app number two, I'd make a replica of data and then send it to them and the storage, I send it to app number two. And app number two have to convert that to their format they like and then operate it and then pass along to the app number three and app number four, things like that. So number one, we'll create a lot of same data with different format and we keep replicating the data. We move data from one place to the other one. So what we IBM storage believe that we need to have a storage virtualization, I will even call it data virtualizing layer if you will to allow people to be able to share data with different application, with different requests, different requests of different assets patterns. Right now we have the same data you can share with the same data through object interface or file interface. But I think that going forward that my vision is that storage will be taking even more responsibility on this sort of ETL the data transformation roles to make storage jobs in the future is going to give the right form of data to the right people at the right time. So the storage isn't just going to be a bit bucket not that it is today, but it becomes more and more intelligent and it's doing more to drive business value. Absolutely, look at today, right? People spend a lot of time to move data from storage number one to computer number one in memory and go to storage number two and go to computer number two and go through networks that a lot of traffic, a lot of infrastructure energies and costs are spending on moving data around. And I do think that in the future, you don't need to do that. But we start on a step number one, we need to have this storage layer to be able to virtualize them, be able to manage them, to allow them to be able to allow ability to inject the intelligence in the storage layers. And this is where, you know, the differentiated value of IBM storage. Vincencio, you're like one of the treasures inside of IBM. We'd love to have you on theCUBE and I hope we can have you back someday shortly. I'd love to, this is good, this is good conversation. It's great to talk to you. All right, really our pleasure. Okay, keep it right there for more action from IBM Storage Summit. This is Dave Vellante, be right back.