 And here we go. Hello and welcome. My name is Shannon Kemp and I'm the Chief Digital Manager of DataVersity. We'd like to thank you for joining the DataVersity webinar today, why you need end-to-end data quality to build trust in Kafka sponsored today by InfoJix. Just a couple of points to get us started. Due to the large number of people that attend these sessions, you will be muted during the webinar. If you'd like to chat with us or with each other, we certainly encourage you to do so. Just click the chat icon in the bottom middle of your screen for that feature. And for questions, we will be collecting them via the Q&A in the bottom right-hand corner of your screen. Or if you'd like to tweet, we encourage you to share highlights or questions via Twitter using hashtag DataVersity. And if you'd like to continue the conversation after the webinar, you can go to community.dataversity.net. And as always, we will send a follow-up email within two business days containing links to the slides and the recording of this session and any additional information requested throughout the webinar. Now let me introduce to you our speaker for today, Jeff Brown. Jeff is a director in InfoJX's product management group responsible for delivering customer-driven solutions across InfoJX's data 360 platform. He holds an MBA from DePaul University and a Bachelor of Science in Engineering from Michigan State University. And with that, I will give the floor to Jeff to get the webinar started. Jeff, hello and welcome. Hi, everyone. Thanks, Shannon, for the introduction. And let me just get started with my slides. Like Shannon mentioned, my name is Jeff Brown. I'm the director of data quality and analytics here at InfoJX. And today we're going to talk a little bit about Kafka and its impact on current organizations and what we're seeing in the market and how we help different customers throughout different industries as well. So I want to start off with just a quick stat. So recently Forbes had mentioned that 90% of the world's data was created within the last two years. So as you could imagine, this is quite a bit of data being generated, and a lot of that is being generated from streaming sources like IoT, different types of devices that are streaming data. And even in the corporate world, there's a lot more data that's being moved around that's being transferred from system to system. And inside of each of your organizations, I'm sure you're seeing this, that the generation of data is just starting to overwhelm the business. So some of that data is also coming from a new platform or a recently new type of messaging platform initially created by Apache Kafka. And if we just take a look at Kafka at a quick glance, and some of these numbers are brought to us by Confluent, which does a lot of these different types of studies and offers software as well. But 60% of the Fortune 100 companies are using Apache Kafka. So that's quite a bit just in its most recent adoption as well. And if you look at over 100,000 organizations globally are using Apache Kafka worldwide. And I imagine it's a fairly steep slope in terms of adoption most recently, and some of these numbers might have even gone up in the recent months. If we look at a 2018 Apache Kafka report from Confluent, we see that 90% of Kafka projects are considered mission critical within the organization. If we then look at what are the adoption and why is why organizations going to Apache Kafka, 62% of the deployments are actually replacing existing technology. And we're seeing that with our large customers as well, where they're moving from some of the old file-based systems into this new event-driven architecture to get more data, higher volume, and higher throughput as well. So why are organizations moving to a streaming-based architecture? So even beyond Kafka, just a streaming-based architecture. Today we are going to focus about Kafka, but in general we do have a lot of customers and we see a lot of research being done that streaming data is becoming more and more popular even in enterprises today. So if we look at what is Apache Kafka, Apache Kafka is really a real-time streaming message system that is built around the concept of a publish and subscribe. So similar messaging systems do exist where you're either it's a Q-based type of system, so if you think of the MQs of the world. But in the publish and subscribe system, you have publishers for Kafka, publishers, which are really the creators of the data, the messages, that are feeding the messages through different topics and consumers on the other end are then subscribing to these different topics and then receive the messages from that different topic. So it's a little different from MQ where you could have a Q being sent messages, you could have the concept of a subscriber taking messages off of that Q, whereas in this new Apache Kafka world, you have messages that are being created and are just submitted to one or many different consumers that can then take the messages off of the different topic and they don't disappear, they are retained for a particular amount of time. And messages are stored in topics across Kafka in many partitions, really helps out a lot of the different redundancies that enterprises are looking for to keep multiple versions of their data so that if anything were to happen in a particular partition that it would have a failover. So if you look at it in a more visual concept for Kafka and just a general data pipeline, if you look at the flow of data, you've got the producers on the left-hand side and as I mentioned, this could be devices, it could be log data for machines, it could be application logs, it could be Internet of Things or it could be general applications or third-party vendors and even files to a certain extent could be generating messages that are then put into a topic. They are basically brokered by the Kafka platform, they're published and brokered by the Kafka platform and the messages are then consumed by the consumers on the right-hand side through a subscription-based type of model. So if you just look at the flow of this chart, you can see that it goes from left to right and this is just a very generalized pipeline flow but the concepts are the same. When we talk about producers, that's anyone who is creating messages, posting it to a topic and then the consumers are really subscribing to those messages, to those topics and receiving those messages in real-time. Some of the advantages that we hear customers talk about are listed here. For the majority of the customers, they are really looking for that real-time data availability. So with Apache Kafka, it enables organizations and enterprises through their event-driven architecture to reduce the lag time within the data that's being posted and even move from one producer to a consumer so that data is now available in real-time or even near real-time as soon as it happens, which allows the consumers to have faster responses to critical data-driven decisions. There's also the concept of Apache Kafka acting as a centralized access to data. So if you think of that concept in the middle of that chart that we had seen, the Kafka platform gives that central access to the data in one location. So instead of building out multiple integration points, if you look at the point, one of the points on the screen on the right-hand side, it reduces the amount of integration points as well. So that IT, who is basically handling a lot of the Kafka management in terms of the technical platform side, it gives consumers a centralized access to the data and from an IT perspective, it is a reduction in integration points. So if you have thousands of applications and thousands of downstream systems looking for that data, you don't have to build out an integration point for each one of those. Kafka acts as that centralized data platform to a certain extent. It is also built to scale to massive amounts of data. So high volumes of data as well as high throughput. To use Kafka, it is built to handle a very large amount of data where sometimes some of the other systems and even message-based systems would potentially drown in the amount of volume that is being handled. It also acts as a type of data storage layer. So one of the advantages of Kafka also is that it can provide a certain amount of data storage so that if you wanted to retain data, messages, let's say, for a week or a certain amount of messages even in terms of counts and amounts, you are allowed to do that so that if you are subscribing to a topic, you can potentially even go back in time prior to the time that you're subscribing to it to retrieve data and to retrieve messages. And again, if we're looking at the last bottom right, there is a certain fault tolerance that I mentioned that is built in into the system. So it provides that additional security type of data potentially being removed or deleted or not available within the entire platform. Some of the key drivers to move to Kafka also include the ability to make faster business decisions. So the faster the data is put into the decision-maker's hands, the faster the decision can be made. So this is also an advantage when we talk to customers, we ask them, why are you moving? What's so technical? Some of it is we can no longer live in this batch-based world, we need access to more real-time data. It also creates a unified data hub so that business can consume the data from a centralized point so that helps the different applications, the downstream applications go to a single location to receive the data and to even receive some of the data that might be critical to their decision-making and critical to their downstream systems. It also gives better data access to data scientists and analytic teams. So this is also in conjunction with some of that real-time data. So depending on what type of data your data scientists are looking to consume, they now have access to this real-time data within their organizations because they can have it as soon as it's generated, as soon as it's posted to a topic they can then consume it and then make decisions, build it into their models so that they can then either train on it or learn from it within their systems. And that same goes for the different analytic teams. If they're looking to make analytic decisions, you can have data that's faster and at a higher volume available to you. And lastly a lot of the companies that we speak to, Kafka's part of their digital transformation strategy, and what I mean by the digital transformation strategy is that they're trying to move a lot of their on-premises type of systems to a cloud-based system. They're moving to a more event-driven based type of architecture and these are large companies. These are Fortune 500 companies that perhaps in the past may have not moved as fast in terms of adoption of this sort of newer technologies but we're seeing that more and more throughout our customer base that as they move to Apache Kafka or they have it in their sites within the next 18 months that they are, that that is a key part of their digital transformation strategy and a lot of that has to do with the real-time availability of data and even some of that redundancy that I mentioned before. However, it's not always rainbows and sunshine for these organizations that are adapting Kafka. As we talk to them and as we talk to them and understand some of their challenges we've collected some of them here to talk about. So what are they saying? A lot of times since it's such a new architecture and it's a new jump into a piece of their business flow that audit is not just going to let them adopt this blindly so audit will not let us move forward with our Kafka platform without being able to validate the data. So sometimes it's audit driven so that they're saying everything looks great in terms of the architecture, we love the concept of having data in real-time but there's no way that audit is going to let us move forward with transferring this large amount of data from one system to another in terms of producers to consumers without any type of insurance around it so that audit needs to know that what was created is being sent etc. So that they need validation on the data that is in motion. So if we think of Kafka's data in motion that the data is from producers to consumers really needs validation before it's being pushed out or posted to some of their downstream systems and auditability of the data and process is still a very big key focus for the different companies that we're speaking to. We also have several companies talking to us, several large companies across industry saying we're moving all system to system communication from a file based to Kafka messages. So what we're hearing is that they are no longer generating or extracting any type of data or files in a batch based type of manner, they're moving strictly to Kafka based system and what this means is that it's a new means of digital communication between the systems and there's a true recognition of real time data access. So this may be driven by business saying we need the data faster, it might be delivered by IT but all in all they're saying no longer are we creating any type of files and storing it or moving data in that manner we're moving away from any files and moving strictly to event driven architecture or moving to Kafka as our centralized type of data hub so that we can have access to this data in real time. And lastly we're also hearing some hesitation as well from customers from this is from a large bank. We don't trust the stability of our Kafka platform to expand its usage. So again, so there's reasons why they're moving and why they're pushing to go towards it but there's also some hesitation saying we've got our Kafka platform in place we're not in production, we're getting we're ramping up to production but we're really not moving forward because we just don't trust yet, we don't have the validation on the data, we don't have trust in the overall Kafka platform not because they don't trust Kafka itself it's just because it's so new that they don't have their arms completely wrapped around it and they can't get transparency and clarity of the data inside of the different topics and streams that are being generated and pushed to these critical business systems. So really they do require insights into their operations. What we're also seeing from customers and from just companies that we're working with is that it might be a new technology but it's going to be the same challenges that they've seen before. So what we're seeing is that they're running into the same data quality issues even with their Kafka platform. So if you take the old adage, garbage in, garbage out just because data is being pushed through faster at a higher volume does not mean that the data quality is any more improved than it was before. So I think that's a key concept that resonates with a lot of the different enterprises that we work with is that it is even more dangerous now or potentially even more risky endeavor that they're taking on because it's the same data that was from the producers that are producing the data except it's now being pushed to the decision makers hands even faster so that the quality is still the same they're not doing anything to the quality it's being put in the hands even faster. So it's almost like turning up the conveyor belt even faster data quality is going to be the same as what it was coming in and it's going to potentially have an even a faster response time in terms of the decisions that are made. So we've got a list of questions that we know what the organizations are saying what should they be asking to sort of squash some of these challenges they should be asking do we know if all the transactions that were supposed to be sent were sent how do we know that all the transactions that were supposed to be sent were sent do we even know when or if duplicate data is being sent through the different topics through the different streams how do we know if duplicate data has been sent do we know if the data across transactions and even let's say the messages whether it's policies or claims or different trades in terms of financial was aggregated and transformed correctly how do we know? So if audit saying prove to me that this data is being aggregated in the way it's supposed to be they're having trouble answering that in an honest manner in an honest fashion because they don't have clarity into that into the data that's being pushed from system to system again how do we know that all the transactions supposed to be sent had arrived how do we know that they even are being sent in a timely manner or even in the correct order so as we're talking to the customers they're saying we actually don't know that everything is being pushed a lot of times since Kafka is fairly a new project or a new platform that they're taking on it's mostly about setting up the infrastructure making sure that it can handle the value making sure that all the topics are being subscribed to appropriately making sure that it can handle the size of the messages we're talking to customers and they're talking about messages from three kilobytes to even three megabytes so it's a broad range of different sizes that these messages are being sent so some of it is just operational can we get the messages up can we get Kafka up and running but really when you take a step back once the business starts to adopt the real-time data that's being sent over through this new platform these questions will arrive so I need to know that you're not sending me multiple claims in the same message or even multiple claims in the same topic or across topics I need to know that I'm not doubling up the amount of transactions that are in my daily trades and even how do I know and how am I notified if the right amount of messages came over in the same day so there's a sort of threshold a threshold tolerance that's being built how do I know that the number of messages that were supposed to be sent weren't either above an upper limit or a lower limit that are within this sort of tolerance or threshold within the organization so there's the whole concept of the monitoring of the Kafka streams and then the context of the messages themselves is different so there's almost an operational view and then a context or business view of the questions that these customers should really be asking themselves and so what is the impact to some of these challenges so if you take the negative from all those questions a duplicate message was sent a message was sent late or it was incomplete or it was transformed and aggregated incorrectly from an IT and operations perspective they're unable to monitor the different data volumes for the different anomalies like I mentioned what if all of a sudden messages let's say there's 10,000 messages that were supposed to be sent in a topic in a day and that drops to 100 messages and the customer or the consumer was really expecting to have 1,000 messages well they need to be able to respond to those anomalies appropriately so that they can find out and do a root cause analysis because maybe something is wrong technically maybe the data is not arriving on time or maybe their source isn't being posted so there's a couple of different challenges and impact to the IT and operations so if we look at how does IT also then predict any of the different data volume that's needed for retention so if they're unable to monitor the streams and do an accurate prediction of the amount of data that will be stored and retained inside of their Kafka platform this could be an issue this could run into some different type of processing errors a different type of downstream systems that are being impacted because data may not be delivered in a timely manner and then again just having all these different types of monitoring allows IT to potentially identify underlying infrastructure issues so it could be a hardware type of issue from the business side those that would be consuming the data you could have incorrect data that's being consumed that you're making the wrong business decisions on and again like I mentioned you could be making decisions faster now that data is being delivered faster you could also ultimately lead to customer loss you could have a harm to reputation you can have revenue loss regulatory fines etc depending on the type of data that is falling through the cracks or the data that is not being delivered on time or the data that potentially being delivered incorrectly so all of this could lead to a sour customer could lead to real money so real dollar in terms of regulatory compliance fines and then ultimately if you take a step back what this means is that there's a reduced overall trust factor that comes into play so that the business doesn't trust the data whether that's to put in the regulatory report decisions on or to do any other type of reporting on so that it leads to reduced trust reduced adoption moving forward do you build data trust within your organization and especially how do you build the trust in Kafka and provide that end-to-end data quality when I think of the data pipeline I kind of revert back to some of my manufacturing experience throughout my career I take a look at any of the data pipeline in any business across any industry if you look at the source data that's coming through when you do different types of quality checks on a car that's being built you don't put quality checks at the very end right so as the parts and as the raw sources are being put together you're doing the different types of validation and monitoring and quality assurance on those parts same goals for processing of the parts so if you think of that in the manner of well now these parts are being put together in terms of processing they're not ready to be packaged up and sent out but once they are you've got that QA and the quality assurance that's being put on that as well if you think of this similar to a data pipeline you've got raw data from the left you've got the source data coming in you've got third party data this all needs to be monitored and validated you can't wait until the last minute so that at the data warehouse we're actually doing the validation on the data so if you think of potentially Kafka being in the middle here or even on the book ends of the process you really need validation and data quality throughout the process you can't say I'm doing data quality on my data warehouse because I'm doing cleansing or I'm doing different types of validations on it at the very end by that time it's already too late you don't know who's consuming that data you can't track down the source system by the time it gets there potentially is aggregated and transformed in a manner that is even unrecognizable from what it looked like at the source so from an infigix perspective we talk with an enterprise or we talk to data strategy teams we are really looking at the validation and data quality from an end to end perspective all the way from raw data raw source to that finished product whether that's in a data warehouse or extracted to a data mart data lake etc to your MDM system we're looking at it from end to end and that's the same way that customers enterprises need to look at it because you can't just rely on it to say you know what I'm just going to add the cars rolling up the line I'm going to do some QA on it well that doesn't make any sense because you're going to end up with problems that you can't then do anything about because what if it's if you look at it from a car analogy what if it's something to do with with the engine well you're going to have to take it out and it's a lot easier to fix it up front than it is to fix it at the end so when we talk about end to end this is really what we are we are referring to from an infigix perspective if we take that producer to consumer type of approach when we talk about how we provide that validation that data quality that reconciliation even and balancing we're looking at it holistically so not only do we look at it from what you need to be able to get in line to the message before it's posted to the consumer and you need to provide that validation but you're also looking at data that could be coming from source systems that are not Kafka messages or are not streaming messages so you need to be able to provide that data ingestion from all those different types of sources along with the Kafka message so that you can read the data that's ingestion from Kafka so you can read the messages you can even potentially read in data that's not a Kafka message aggregate it, transform it and then spit out a Kafka message as well for those downstream systems to consume so we look at it as we insert ourselves into the key process flow so that you can provide that insurance that the message, the context of the messages is being validated through different business logic and rules engine but also from a threshold and monitoring IT perspective that you're able to provide that visualization and monitoring of those amounts and counts of the different topics as well so it's all interrelated in terms of data quality and as you look at how do we create a new data quality project well you can no longer look at I'm pointing to a data lake I'm pointing to a data warehouse and that's where my data quality efforts are being focused on again by that time it's too late the new cars rolling off the line so by providing validation up front and as early and even as often as possible whether it's producing systems, whether it's sour systems beyond this you need to be able to provide that level of validation more so around Kafka you need now you have higher volumes higher throughput a new technology business consumers that are looking to make critical business decisions faster you want to give them the assurance that you really are providing them with the highest quality data that's not only from Kafka but even before that so if you look at it from before and after Kafka and even within Kafka that's what we're looking to provide and that's what has really resonated with customers and has been successful we look at it from a multifaceted approach in order to build trust from producer to consumer going back it's not only data quality so you need to be able to have the conformance checks the different uniqueness checks completeness checks all of the key dimensions that you need to put into play but you need to have reconciliation so if we look at Kafka as a data in motion data movement mechanism you need to be able to say that this number of transactions and messages were created from the producing system and this number of messages were received from the consuming systems and they need to be able to reconcile the counts and amounts and even at the transactional more detailed type of level that it needs to be reconciled and it needs to balance throughout so that you can build that trust within the business so that they can go on and expand usage and trust their decisions and lastly even the integrity of the messages and data that's being sent from left to right if we go back and use that diagram as an example the integrity did any messages fall off was everything that was supposed to be delivered in a timely manner was delivered again getting back to the integrity of messages to making sure that I know that whatever my data that is leaving a producer or leaving some of those producing systems is ending up in my consuming systems and even downstream further in the highest integrity highest data quality as well so how do you provide a 360 degree standard to data and trust again if you take a step back you look at it from producer to consumer so you're setting up the data flow the data pipeline of the data that's moving from producer to consumer but you have to look at it from a top to bottom approach as well so not only do you need to identify the data that's bad you need to have a rules engine that's flexible enough to have business users be able to create rules to be able to transfer their business logic into real data quality rules but you need to be able to have different mechanisms to monitor that data as well so monitor the stream so you're monitoring continuously it's not a one and done you don't point it at a data source say give me the data quality no you're really trying to put in that reoccurring monitoring of the data and messages that are coming through and flowing through your data pipeline flowing through your Kafka platform so you have I've got the rules I'm able to build the rules I've migrated my business logic into some usable data quality rules I've got my monitoring how often I'm doing these checks and what needs to be done in terms of this check first that check first okay if that fails what do I do next so that's more of a monitoring who do I alert which key stakeholders do I alert but once I have that data or that bad message that's being identified what do you do with it do you have a mechanism to then remediate that data to quarantine that data to then notify those key stakeholders to say here's a topic that could be then producing a bad data stream and then split it into a good data stream and a bad data stream but bad a good topic and a bad topic do you have the mechanism to keep the key stakeholders and the providers the producers of that data accountable so that you can go back and say you are the highest offender of the quality that's being monitored in our Kafka platform today and here's the proof here's some of the data we've got it in our system it's been assigned to an internal queue to be worked through a workflow you can assign it to users and groups to then look at or even send off downstream but not just having that raw data and having the bad data and having a mechanism to remediate it and track those remediations you need to be able to communicate it appropriately through visualizations through dashboards through KPI through different metrics and they need to have bar charts and dashboards and line charts and all the different type of graphics that you can imagine within a single system so that all the different roles are being addressed so anywhere anyone from the data quality analyst role or data analyst role to a mid tier let's say assign manager all the way up to executive so that you're able to roll up those KPIs and metrics and all the different bad data type of metrics and even monitoring or good data metrics so that you can communicate it appropriately throughout your system and so that's really our 360 degree standard to providing that data trust that's how we view it and that's what we provide Infigix provides it through the data 360 platform so we've got a three pronged approach to how we tackle some of these items we tackle it from a data governance perspective so not only do you need to know what data you have but you need to know who owns it you need to know where it's at you need to know when it's changing so you need to be able to obtain that metadata and catalog it appropriately assign owners and do workflow of changes etc so that's from data governance perspective from a data quality perspective too from a metadata perspective on data governance more so you're knowing your data from data quality you need to be able to trust your data so you need to provide that data quality baseline you need to get at that transactional level you need to be monitoring the data that's flowing through and that's what we're talking more so about here today around your data quality being able to build that data quality and trust and then finally getting your hands on the data and being able to harness it be able to transform it even more so within the system and that's more for our data analytics approach so if we look at data quality a little deeper for streaming data what does that mean for the companies and customers that we're working for we talk about the end to end pipeline what are some of the key functionality features and characteristics that are beneficial to provide data quality on streaming data so you need to be able to provide data quality on real time and batch so validation on real time and batch data so that's streaming data that's file databases some of the more traditional types of data in a required time window so if we're talking Kafka we're talking fast high volume so you need to be able to say I need to captured every single message in real time or if I'm monitoring a topic I need to say how many messages came through in a certain time window and even more so I may not need a decision or validation on my data so I can do it more in a micro batch type of mechanism minutes or seconds or an hour every hour you need to be able to provide that balancing and reconciliation like I touched on before need to provide the reconciliation from source all the way to target and Kafka sits in the middle so if you've got data sources that you're converting to Kafka messages maybe you want to move beyond the Kafka message and get before it turns into a Kafka message to those source systems so there's a lot of tools out there that are transmitting files and database data tables into Kafka messages so what if you want to get at that data to say I know that was whatever was in my table ends up at the consumer and I've got validation logic all the way from every step throughout that process again I need to be able to provide that transformation and aggregation logic so you need to be able to have that rules engine to be able to form and aggregate aggregate the data on streaming and non-streaming data as well I'm looking to enrich streaming data also so if you've got different topics that are generating let's say claim information or policy information or a financial trade I'd like to enrich that message so that it could be enriched from a non-streaming source or it could be enriched from another streaming source another Kafka message I need to be able to do that so it's all these different type of operational checks and validations on streaming data and even non-streaming data and again being able to visualize I need to be able to identify and manage exceptions I need to provide that upper control limit lower control limit I need to know if messages are degrading inappropriately so that I can put in some of those key statistical controls to say if over the last four data points of the last seven data points Fiverr are on a negative slope identify this key stakeholder like some of the western electric type of statistical controls that are often implemented I also need to have machine learning on some of this data as well so there's a large amount of volume that's being touched that's being monitored that's being validated within Kafka stream so I need to be able to have machine learning around that to identify anomalies to identify outliers to learn from to build my models in and I need to have that in a tool that's also doing the validation on that message so it's redundant to have multiple systems doing machine learning as well as some of this data quality validation if you're if you have that that message and that data in your grasp you might as well do machine learning on it as well to build out better insights in your in your enterprise from a data 360 streaming functionality if we dig more into the infagix product we're seeing a lot of basic functionality that is that is very critical so it can can be basic and it can be used in a very much more complex manner we'll touch on that later but you need to be able to ingest streaming data stores you need to be able to output to data streams as well you need to be able to convert that streaming data to batch data I need to be able to do some sequel on some of that streaming data as well some sequel statements because I may have certain sequel statements that I want to transfer over from some of my existing checks and I need to put them on some of my Kafka messages I need to be able to join different types of streaming messages with both streaming messages and batch messages I need to be able to provide that deduplication of messages that we touched on early one of the key questions that companies are asking how do I know if data has been sent multiple times or if I'm receiving duplicate data you need to be able to provide that deduplication type of mechanism if we take a step back in terms of use cases like I touched on we're looking we're hearing from many customers that they have both streaming data inline type of validations and data checks that may happen in batch so when you look at the end-to-end data quality on a Kafka data pipeline we have different customers and different use cases for both if you just do sort of a high view at this you've got the ability to be able to read messages you need to be able to provide some type of a sequel on that streaming message and then I need to say I'm going to push it to the good data so I'm going to push out these are the messages that are validated and then I need to identify some of those that are bad data so that I can then potentially do something with those failed messages I can send it to a workflow I can remediate it as well I can route it or internally so internal remediation or you know what I'm taking those bad messages don't post them to the to the consuming system I want them to go to a topic that only exists to basically post the bad messages or the bad data because I don't want that diluting mainstream tools that are then being used to make business decisions on so this is more of that inline type of check if we're looking at how do we do traditional types of data quality rules or even some of that more the more reconciliation rules that you may have a static reference data set or you may have some pre-built data quality rules inside of the tool and you need to be able to read multiple messages and sort of aggregate them and batch them together to provide an even profile of a potential cue you can do that as well so this is an example of I'm able to read small batches convert them to a batch provide some type of data quality on it again push out the past push out the failed or I can push them out to a single topic with that contain both past and failed with additional attribute on the message that says here is the status and it could say past it could be failed and then the downstream systems can then do their validation on it itself so you need that flexibility to be built into the tool so that you can create these new types of topics because you're going to be faced with these types of use cases whether it's today or tomorrow it's not going to be a fully real time based type of validation on your systems you will have some of this batch type of checks and then again if we're looking at some of this a more complex type of ruling so that you're saying I've got streaming data that's coming in I need to validate both streaming and non streaming data sources and I need to output it in a sort of a single swoop in a streamlined approach I may need to join non streaming data sources with streaming data sources so I'm taking non streaming data such as like this database and I need to join it with some streaming data sources and then output it to an external system or to a a Kafka message so that you've got your options on whether I'm capturing Kafka data, Kafka messages from a topic or I'm capturing files and databases in some of the more traditional types of data I need that all within a single tool so that I can have all of these options at my fingertips because you could have a large Kafka project that's going on today and that is the goal is to remove all of the files from any type of data transfer but in all reality if it's let's say a third party vendor that's providing you a file or you just have a CRM that doesn't provide any type of Kafka messages you will be able to be faced with some of these use cases that push in I need this reference data it's not streaming or I need this streaming data to then be enriched with something that is from a reference dataset so you will be faced with these type of use cases within the overall scope of how you adopt Kafka moving forward and again just to put in a plug Infigix provides this as a single solution to build out the data so if we're looking at data quality rules so when you're building out these rules we have over a hundred different type of predefined rules that go against a single type of attribute I mean we have everything from conformity, completeness, uniqueness, etc. we've got the major data quality dimensions covered so that you can easily implement them we have the built in type of validation in terms of the rules we also have the visualization inside of the system too as a single type of a mechanism to build out those dashboards and then above on top here we have workflow and routing in order to remediate and resolve those issues so that you can then do something with that failed data and again don't trust us we're hearing this from our customers today so it's not just Infigix speaking it's more of a representation of what we're hearing and responding to in the market right so we're working with an international bank with over a hundred billion in assets that they're looking at doing the reconciliation on their streaming architecture so that they're looking at streaming messages, they're looking at static data across their use case we're also working with a large financial institution that's delivering our Kafka capabilities as part of their general data integrity and data quality controls group so that's more of a traditional type of checks and balances and then one of our largest health insurers in the world and a 30-year Infigix customer we're working with them now to plan for their Kafka adoption in preparation for their event-driven architecture that's going through this new implementation within the next 12 to 18 months because they are getting beat on by audit to make sure that they have this validation in check and that they can prove out that their Kafka platform is fully secured and fully insured as well so some of my key takeaways before we open it up for questions is really organizations are moving towards adapting Kafka this is something that we've seen across all industries just as they move from mainframe to more of the distributed database driven world we are now seeing this in a communication adoption that they are adapting Kafka in a pretty rigorous type of adoption phase across all industries again the same operation of data quality issues as before garbage in garbage out so you need to make sure that the Kafka messages are being validated faster delivery higher data volumes lead to data quality increased quality issues if they're not managed properly so it could get out of control and then lastly our core message the entire data pipeline from to end must be validated monitored and approved for the data quality type of basics and steps and metrics that meet your companies needs from and in order to build out that trust and to optimize some of your streaming data investments as well so we're talking from the adoption of Kafka in terms of your organization but there's a lot of money that's being invested into those different types of projects you want to make sure that it is the highest success and having validation from end to end is really one way to have business buy-in and to get clarity and insights into your projects moving forward okay so you can find out more on InfraJigs.com we've got several infographics e-books, data sheets and blogs that you can use as a reference to find out more about InfraJigs some of our positioning on end to end how we deal with it as well you can also contact me my email is below you can contact feel free to contact me directly and we'd love to chat with you and find out more about some of your challenges in terms of adopting Kafka in your organization also or what you're seeing in the market okay I'm going to open it up for questions yeah thank you this has been a great presentation if you have questions for Jeff feel free to submit them in the bottom right hand corner of your screen in the Q&A section and just a reminder to answer the most commonly asked questions I will send a follow-up email for this webinar by End of Day Monday with links to the slides and links to the recording of this session so diving in here Jeff any lessons learned for streaming large structured databases for on-prem to cloud oh I'm sorry I was on mute because I was taking a sip of water so we're actually seeing a lot of that attention from the customers that we're talking to as part of their digital transformation strategy so they're moving a lot of their on-prem data warehouse some of their core business systems that is on-prem and they're adopting you know and we're seeing Azure really taking a large piece of that market share more recently than not in terms of what the big big companies are adopting so we're seeing a lot of that we've got a data lake on-prem we're now moving to Azure to move that data and they're building in these types of checks and balances to make sure that anything that is on-prem makes it to the cloud and again it's sort of that same left to right you're you then go into the producers are on-prem and the consuming downstream systems are on the right so we're seeing the same attention being drawn to what does it look like in the database then what does it look like when it's posted to the cloud whether that I just want to know that you know you kind of start macro and then move into the micro in terms of when I'm moving data I look at my database on-prem and I've got 10,000 transactions today it's moving to the cloud at the end of the day or just doing a check you know every hour and then I need to know what's actually been posted to the cloud that's at a more macro type of perspective and then you get into okay now that I do know that all the transactions were sent in terms of constant amounts let's dig into the context and apply business logic to make sure that nothing was inappropriately transformed or inaccurately aggregated so we're seeing those types of checks that are being implemented from our customers that are doing that got a large on-prem type of Hadoop data lake and then I'm moving everything to my private cloud and Azure in our system so we're seeing a lot of that I don't know if that answered your question but again if it doesn't feel free to reach out to me directly yeah definitely and that's great and Jeff if third party is providing their raw data as files what's the best way to ingest the data are there drawbacks to to lay on the data into a Kafka topic for cleaning and put the clean data into another topic so that's a good question so I mean there's quite a few mechanisms and companies out there that provide if you're just looking to move a file to a Kafka message that do that but they don't have a lot of that business logic type of rules engine in place so that when you are taking files from even if it's on-prem or from a third party it just depends on the criticality of the data that's being posted to that consumer I always like to say marketing campaign data is a lot less critical than financial statements in terms of regulatory reporting so you need to get every single piece of data and every single tidbit of data for that financial report prior to being posted rather than if it's marketing campaign we can sort of maybe be a little more lax in terms of what's being posted to the classroom system so send all of it or there's certain school of thought where you say send me everything that's bad just let me know what's bad and I can then work on it or work with the third party and again it's holding that third party accountable a lot of customers that we talk to say we buy data whether it's from a financial provider or some type of trading platform and we buy it and it's crap so we don't have any way of tracking it but we know we're paying a lot of money for it and we just need to hold them accountable so before they ingest that they say don't stop it don't do that inline validation where we're sort of playing bouncer where the validation is playing bouncer to say only let the good stuff pass let it all pass through but push some of it to the side and maybe redundant that storage at that time but let me know so then I can have that data at the ready to then feed it back to that third party provider and say the last file you sent me was only 80% complete and I have proof and hear the values I want you to go back and fix it that's really that ongoing type of improvement if you think back to some of the manufacturing continuous improvement initiatives you really want to stop bad data or bad occurrences that happens and then that way if it does happen again you know that it's fixed immediately and it's not going to be an impact downstream 11 I think we have time oh okay I was just trying to get myself on mute and I think we have time for one more question here any insight on handling privacy and security concerns in moving data to the cloud yeah absolutely just in that's just a general privacy type of concern whether it's GDPR some of the California regulations that are coming through that will only spread I mean security and privacy of data is here to stay it's not going away it's only going to get stricter so as you're moving data to the cloud you really want to be able to identify some what is PII what is being used for some of our PCI private or credit type of initiatives what is our HIPAA data so you need mechanisms and solutions and a plug here for infigix but you need a tool to be able to identify what is PII what is sensitive data you need to be able to know where it's at who's using it who owns it can the people that are using it use it in what manner can they be approved to use it is there a mechanism to allow approval to be used so there's all that type of rigor to make sure that yes data that's going to the cloud five years ago I would say the majority of customers we talked to said you know we're just dipping our dipping our toe and sort of navigating the cloud space now as a company as a vendor we do we do not see a single RFP that does not have what is your cloud option so with the structure around security and in light of keeping your company's name out of the news there is a there's a lot of security that needs to come into place when you're moving data from out from into the cloud but there's lots of tools to then to keep that you know security of transmission security of data at rest as well but yes there's lots of precautions to say the least well Jeff thank you so much for this great presentation and thanks to InfoJix for sponsoring today but I'm afraid that is all the time we have heard today just again a reminder I will send a follow-up email by End of Day Monday for this presentation with links to the recording and links to the slide. Jeff thank you so much and thanks to all of our attendees thank you