 Hi everyone, welcome back. We are at the Flink Forward user conference sponsored by Data Artisans, the creators of Flink. We're here at the Kabuki Hotel in San Francisco. We're on the ground and we're with Tom Keitchuk, who is senior consulting engineer at Dell EMC. Yes. You had a fairly exciting announcement to make this morning. Why don't you give us a little background on that? Yes, so we're announcing Provega, which is a new open-source product that Dell's been working on for the last year and it's we're opening the floodgates on May 10th and it's going to act as a streaming storage system. Okay, so help us square a couple circles. So we all learned over the last couple years as Kafka took off and sort of swept large and small enterprises alike, a large or medium-sized enterprises alike, by storm. That rethought the way to communicate data between applications. But as you were telling me, it still makes assumptions about conventional hardware that it runs on that might be perhaps sub-optimable. Yeah, so I think the difference between what we're doing and what Kafka is, is just fundamentally comes down to the model. Kafka is a messaging system and its model is built around messages. Ours is a streaming system and we operate fundamentally as a stream. So when a client sends bytes over the wire, the server does not interpret them at all. It is opaque. It is analogous to a Unix pipe or an HTTP socket. What goes over isn't interpreted and that gives us the ability to sort of channel that data and we ended up piping it into a long-term archival system which gives us advantages in terms of storage. So we're in a system that's like Kafka where you need performance and you need to get high throughput. You're going to basically run on machines that are built for IAPs. They're built for capacity to get data in and get data out and that works and it's fast. But what it doesn't give you is it doesn't give you cheap long-term storage. So usually what people do is they have a separate system for cheap long-term storage that's usually something like HDFS. So you end up running a Kafka job that reads out of your Kafka topic and ends up writing to HDFS. So what we're doing is building a streaming system that is directly taking the stream that's coming in from the user and holding it locally and giving you the ability to stream off of it and the ability to connect to it and listen to it in real time and giving you strong consistency. But at the same time the ultimate place where this is stored durably is in your long-term storage. It's in your HDFS and the advantage of that is that your storage becomes cheap dense storage that you're used to configuring for HDFS and so you can configure very long-term storage. So you can use the same interface to back up and go to last year and stream forward. And the advantage of that is that you don't end up in what I refer to as this sort of accidental lambda architecture where you build something like a flint cluster and you say, oh, well, this is great and it connects to the streaming component for Kafka and we can stream data and we get real-time analytics and we can do all this nice stuff. But then if we have a bug in our code and we need to go back, you actually need to flip to a different connector and deploy a different job to refill a backfill from a different storage system. So we're ending to solve that problem. Okay, so let me make let's frame that so that a customer today would have a mainstream customer who's been working with Hadoop. Would have their data lake which HDFS and their, you know, their data which is sort of the big sort of old archive. Yes. And then they would be using perhaps Kafka either to ingest data, an additional data into the data lake or perhaps extracting it for an application that wants to process it with, you know, continuous processing or low latency perhaps. Now your solution comes where you want an emphasis on speed and scale and you're not reformatting the data essentially to hit the disk in a in a format that's understandable by the file system. Your data is trying to move along in the format of memory. Yes. If I'm understanding correctly so there's a lot less translation going on and you use partly because of that and partly because you have I would I'm understanding higher capacity storage. You don't have to spill to disk and exercise all that IO that you would get from expensive disks. Right. So HDFS big data, the Dell EMC solution much faster data than Kafka and then so that makes it a good citizen in a in a world where you want to build more and more continuous applications where latency, every last bit of latency, is the enemy. Yes. Yes. So our goal is to get very low append latency and that's important because we can't like right now you can't reasonably do something even analogous to streaming off of HDFS because the right latency is just too high. You end up calling right with with a small bit of data and you're talking a hundred plus milliseconds and then you need to go turn around and read and your read performance will be very low if if you do lots of tiny Appendance. So we give you as a system that lets you do lots of tiny appends very fast very low latency, but the same time the data is ultimately being stored in HDFS. So you still get the nice bulk storage capacity of HDFS, but without incurring the penalty of all those tiny appends. And just to be clear those those tiny appends it's like your system is absorbing whatever volume or velocity that's thrown at it. Yes. So it handles the back pressure and then rather than HDFS sort of backing things up because of its high latency right path. You're absorbing all that because you're not very resource-intensive being optimized for speed and capacity and then you can put it back into the long-term store HDFS. Yes, we can aggregate all these tiny writes into into one or two big writes and put them in. So so tell us some of the use cases that you're working on with you know design partners or Right. So the big one we're working on with data artisans is we want to get exactly one semantics in in a in a in flink jobs that are that are Derived from one another. So for example if you have a job and it takes in Say an order or something and it processes it and it generates some derivative data Today if you want to have exactly one semantic on a job that's running on that derivative data It has to be co-located and run with the first job and that's problematic for a number of reasons Mainly mainly because in a lot of companies you don't want to have some secondary job impact the primary one So you want something in between that can that can operate as a buffer there but right now there's no way to do that that with a Streaming pipeline without giving up exactly one semantics and exactly one semantics is a really big deal for a lot of a lot of flink applications And so what we can let you let you do is have one flink job that runs Produces some output and then goes into pervega and as a sink And then that pervega turns around and is a source for another flink job and you can still have exactly one semantics end to end Okay, so It sounds like just the way Kafka was sort of the source and sink consumer producer Through a hub But once it was handed off to another system it lost that Exactly once guaranteed. Yes, and as we said wasn't optimized for necessarily for throughput and capacity So that's how you guys solve that problem, okay, so If you were to pick some Common applications that have been, you know attacked by or served by Kafka and and flink Which ones where there are there certain characteristics that would be you know most amenable to the Dell EMC solution? Anything that requires strong consistency So so the so the real difference that we have is is that we have a strong system application So we don't just have this this one API that that that's dealing with events and so on We actually have this this low-level primitive and we're building a lot of a different API's on top of it So let me give you an example We have an API that lets you have we call a state synchronizer and what that is is an object that you can hold in memory And across a number of machines and you can perform updates on that object, but it's guaranteed that every Every process that's performing an update is performing update on the on the latest version of that object So that object is coordinated across a fleet and everyone sees the same same sequence of updates and it sees the same object At any given time and that's that's a real advantage for anywhere We were trying to do something was that requires strong consistency So you can do those sorts of applications and you can also do things that require Transactional semantics so one thing that we allow is when you write data to to our output you can do it transactionally so you can have one per Vega stream and Coordinate a transaction that potentially across different different areas of sort of key space that would end up actually on multiple per Vega hosts and have that with Atomic consistency where you call commit and all of them and all of the rights across all of them go in simultaneously and that's that's a big deal for a lot of applications and You can sort of combine these two primitives where you have a state object and you have a Transaction object to interlink transactionality without an external system so you could for example say I have a Flink sync that's going to have a couple of different outputs but one of them is say a SQL database database right and then you could say I want this output to go to per Vega if and only if my Transaction to SQL commits. Oh, it sounds like you get as freebie. Yeah distributed sort of transactions Yes, that's very very interesting. Has that something that's like a that's a handoff that you would get You know, you would expect from a single vendor solution Very very impressive. All right Tom on that on that note. We're gonna have to cut it off Because we are ending our coverage at Flink forward the data artisans user conference the first one held in the US and We are at the Kabuki Hotel in San Francisco. I'm George Gilbert and we're signing off for this afternoon. Thanks for watching