 Hello and welcome. My name is Shannon Kemp and I'm the Chief Digital Manager of Data Diversity. We would like to thank you for joining this Data Diversity webinar, Data Pipelines Without the Headache, How Accessibility and Affordability Enable Data Success, sponsored today by CData. Just a couple of points to get us started. Due to the large number of people that attend these sessions, you will be muted during the webinar. For questions, we will be collecting them via the Q&A or if you'd like to tweet, we encourage you to share how they serve questions via Twitter using hashtag Data Diversity. And if you'd like to chat with us or with each other, we certainly encourage you to do so. And just to note Zoom defaults the chat to send to just the panelists, you may absolutely change it to network with everyone. To find the Q&A or the chat panels, you may click those icons down the bottom middle of your screen. And as always, we will send a follow-up email within two business days containing links to the slides, the recording of the session and any additional information requested throughout the webinar. Now let me introduce to you our speakers for today, Kim Coluba and Matt Springfield. Kim is the Director of Product Marketing at CData and is a data enthusiast. She has been devoted to the data management marketplace for over 25 years with a focus on assisting business and driving the most benefit from their most valuable asset, their organization's data. Matt is a technical engineer at CData and loves to understand and explain technology, starting his career as a software engineer. Matt gravitated towards the more creative aspects of that orbit around technology, writing articles and blogs, making tutorials and videos and hosting webinars. He strives to bellows the ability to speak the language of engineers while understanding the needs of business professionals. And with that, I will go to Kim and Matt to get today's webinar started. Hello and welcome. Hello Shannon and thank you very much for the warm introduction and welcome to all of our participants on today's webinar. Once again I'm Kim Coluba and we're going to do some stuff a little bit differently here today to try to spice things up a little bit as you're participating in the webinar. We're actually going to approach it from an interview style and Matt, as you heard from Shannon's great description of his background is our subject matter expert on data pipelines. So I'm going to be asking him some questions and we're going to share with you guys some fun slides as it relates to those questions. Then we're going to move into a demonstration and then of course wrap up the session with any questions that you guys might have from the participation side. So, without further ado, let's dive into these this session around data pipelines without the headaches. And I'd like to ask Matt the first question so I'd like to go ahead and let's set the foundation Matt. What exactly is a data pipeline. Yeah, it's a great question to start off with Kim and thank you so much to Shannon again for that introduction and for you Kim to be on here with me today. Welcome to our audience. Let's kick things off just speak broadly about what it is we're talking about and why it's worth talking about. What is a data pipeline. So broadly speaking, data pipelines are a category of tools and they facilitate data movement within your organization's data environment, and we can start to unpack that a little bit by understanding what we mean by data environment. And that is just all of the different platforms and applications, maybe databases and so on that generate data, or they aggregate data, they backup data, or maybe they analyze and report on data. So that data isn't always generated in the same elements of your data ecosystem or your data environment as where it needs to be used, and that introduces this need for data movement. So when we talk about data pipelines we're talking about that category of tools that addresses data movement for this kind of business need. That's pretty interesting so my net my question on that. So are you saying that data pipelines are the same as traditional ETL type processes or applications. Quite in one sense, it is true that traditional ETL platforms and systems do fall within this sort of broader category of data pipelines, but recently there's been somewhat of a push to reimagine what data pipelines can be. Most of the motivation behind that is the fact that traditional ETL processes tend to involve a somewhat gratuitous amount of complexity. So when we talk about data pipelines right when we invoke that phrase, rather than ETL system or some other phrase, we're usually talking about that more modern push that modern change in philosophy that tries to get data movement away from the complexity and the cost of traditional ETL. Okay, well that certainly makes a lot of sense. So because of that. Tell me what has led to some of the surging popularity around data pipelines. Yeah so data pipelines are a bit hot right now. It's worth understanding why. And I think that it can be broken down into two component parts. The first is I could call the data ecosystem explosion. And the second, I might call the value of data. So if we start with that first one that data ecosystem explosion. The thing to understand here is that recently, especially in the last 10 years, the amount of systems and platforms that organizations are using to generate and house and analyze and backup data has just absolutely exploded. It's, it's estimated today that the average enterprise company uses over 200 different SAS applications. And that's not even counting the sort of legacy databases or other storage solutions like Excel sheets that inevitably we're all still using, much as we might like to think that we're not. And so we have this just drastic increase in complexity and size of the various elements of your technical infrastructure that deal with data. So again that's one half of it that's the data ecosystem explosion. And the other half that sort of combines with that is an understanding of the value of data. I've heard of this described as the data revolution. It's a bit dramatic but it communicates the point here, which is that in the modern world. It's pretty well understood at this point that gleaning business insights from your data and making data driven decisions is one of the most important ways to stay competitive in today's industry. And decision making is how you make sure that your business insights correspond with the bottom line. It's how you make sure that you're putting your efforts in the right place and seeing tangible results based on the emphasis that you're placing in various elements of your business. And if you're not doing it, then you know you know all of your competitors are so using the value of your data to understand where in your business you need to focus your, your concerns focus your efforts. And that is one of the most important ways to maintain your competitive advantage today. Excellent so we talked about the why I'd like to talk about how do these data pipelines address those two components that you were talking about the data ecosystem explosion and the value of data. We'll combine them together here and we see the natural role for data pipelines. After all, the true value of data for being more specific more practical. That value of data really only is worth something if it is both comprehensive and transparent. So it's not particularly good to have opaque data that you can't look into and understand. It's not particularly good to have isolated or fragmented or siloed data. And maybe to understand why we could imagine a more specific example like data about a customer so maybe you have some information about a customer's interactions with your sales team, and and there's a lot of running pushes and various things that this customer does an accounts table, but what would really be helpful is to be able to get a 360 degree view of a customer's journey. Right you want to be able to relate specific marketing efforts to the interest that it drives on the customer side. You want to see if customers interaction with your support team say, you know, address their concerns change their mind maybe made a sale. To combine various elements of discrete data that's talking about a customer together allows you to get this comprehensive view of that customer's journey and much more valuable insight into how your customers behave and maybe where you need to focus your efforts in the future. But of course the problem here is that in order to achieve this comprehensive and transparent data data frequently needs to be moved from where it's generated to somewhere that it can be aggregated and analyzed. So it's unlikely that the same place your data is generated is also exactly where it needs to be in your ecosystem to bring the most value. So the role of that data pipeline is to facilitate your data movement between where your data is and where you want it to be. Okay, so do both data pipelines and traditional ETL tools address that need that you were talking about for data movement, equally. In a certain sense that you could maybe answer yes in the sense that both ETL and more modern data pipelines are designed to address the same kind of need the need for data movement. But maybe it's not quite right to therefore call them equal. The big differentiator at least in my mind is overhead. So the amount of time and effort involved in building and deploying a full scale ETL solution can be a pretty big barrier in terms of actually actually accomplishing the desired data movement to unlock the value of your data. And in fact, much of the motivation towards more modern data pipelines is this desire to avoid a situation where some sort of overhead whether it be technical or financial makes it difficult for the people in your organization who work with an analyzed data that people who really care about your data have to kind of sit around and not accomplish much. Okay, okay, I think I see what you're saying and I think this visual kind of helps a little bit. Where I'm viewing at the, you know, traditional ETL needing a crew of people, you know, design, you know, of how that data is going to move, getting expertise and, you know, making sure they have the right equipment and the right processing in some place. While the individuals who need to use the data have to wait for that process to be done. Is that what I'm, is that what you're trying to say, or is that what we're understanding here. Yeah, that's pretty much it you really don't want to demand too much of your data citizens or data analysts and whatever title they have in your organization. And there's like there's multiple ways that you could be demanding too much of these professionals. One of the main ways is time right sitting around while the construction crew or the engineering crew gets to work. And also, maybe in terms of specialized skills or training so if you're saying well we don't want our data citizens to wait around for our technical team we want them to do that themselves. Well you might be asking too much of a group of people whose job it is not really to be a construction crew, but rather to work with and get their hands on your data. So, alright, so with that said then who in the organizations, you know tend to be responsible for building and maintaining data pipelines data integration pieces. Tell me how that works. Yeah, so we can, again, distinguish between traditional etl systems and more modern data pipelines. I guess we can start with etl here. The differences in complexity between these etl processes that we're going to start with and more modern data pipelines means that the answer is different right because traditional etl tools typically require dedicated technical specialists and very frequently they need to be not just technically trained but also familiar with the specific etl platform that you're using in order to effectively and efficiently build and maintain these etl processes. So depending on the size and the scope of your etl needs and the size of your organization that might be a full dedicated team that needs to spend their days building out these solutions, or maybe you know it's a few specialized individuals within your IT team or within your technical engineering team, but regardless of how exactly it's comprised it's going to be somewhat of a burden on your technical resources. Okay, that makes perfect sense. So from the etl side sounds like it's more of a team effort, specialized skills, to really help put together an effective data movement process. So what about a modern data pipeline? Who takes charge of those? Right, yeah, so this is the second half, I suppose, of the answer, which is more modern data movement philosophies emphasize democratization and agility. So, because the emphasis is on data democratization which of course has been giving access to many different interested parties. The question can have a little bit of a different answer across different companies there's no one set way that you have to do things. But very frequently maybe as a through line to identify here, there exists a set of citizen analysts or data scientists that probably aren't as technically trained as your IT team or your engineering team. But they have a significant interest in being able to access and view data in the platforms that they're familiar with right maybe it's a BI and analytics tool, maybe it's a reporting tool. And it's often these data gurus who in the context of a modern data pipeline, own the process of building and using the data pipelines. But again, since the emphasis here is mostly on data democratization, it really could be a wide range of teams within your organization that are empowered to control your data pipelines. A key word you said they're data democratization. I know that's really big out there in the market space. Maybe we could take a few minutes and talk about who in the organization benefits most from this data democratization concept. Is that the data scientists that you were describing is data engineer is the data citizen, you know, kind of give me some ideas about who benefits the most from that methodology. Right and it's, it's two groups of people primarily you could sort of stretch it and find some more but I think the most relevant answer here is two groups of people. The first is these data scientists that we were just talking about right and for this data scientists or you know data analysts data gurus whatever it is that they're referred to within your organization. Primarily for them it's an issue of access and convenience right if we reflect on waiting around for a construction crew again, your data scientists don't want to have to ping the it team. Anytime a new data pipeline needs to be created or something needs adjusting. They don't want to have to wait around for the creation of that data pipeline in the first place. So rather, giving these data scientists the access to the tools and the ability to create and maintain their own data movement pipelines that really empowers them to work at their own pace and be self sufficient. Wonderful so that way they can be more self service they can take care of their own needs and they don't have to spend a lot of time wait waiting around so they in essence as this picture shows they can create their own pipelines. You did mention something interesting though about it. Tell me about how do the it specialist benefits from empowering their data scientists or their data communities to become more data self sufficient. Right yes I once again this is the second half I suppose of the original answer so not just the data scientists the data gurus but also the other group that benefits are these technical professionals the it specialists. It's simply a matter of freeing up time and energy to accomplish other things if you're familiar if you have a finger on the pulse of your technical or engineering team at your organization, you'll know they have plenty of things to do. Now if you ask your technical teams to also manage your etl solution, or data pipelines on top of everything else, you can pretty easily see the risk of overburdening your technical resources as you're trying to get your data moved around. Right yeah I definitely can see that and how it also really would relate to data latency. So, I'd like to take, I like to pivot a little bit, and I want to talk about another important topic that companies are trying to deal with today and then that that is that cloud premise divide right. Can you explain to me, what are some of the important elements needed for that modern data ecosystem to help with that cloud premise divide and can you help us understand exactly really what that is. Yeah, so this is always going to come up it's a very important topic for any kind of data movement conversation. As more and more of your integral platforms and your SAS applications your services and so on, are moving to the cloud. So it's much more likely that your organization's data environment and as we described in the beginning that's basically just every element in your infrastructure that generates or deals with or cares about data. It's likely that at least some part of your data environment is in the cloud and it's likely that some part is not so you get this hybrid approach of cloud premise hosted systems so if that describes your environment. It's going to come up as a topic anytime you want to move data around between these various elements. So to give us an understanding how common is this hybrid data environment in that you were describing here today to us. Yeah, so it's hard to know exactly but the best of the research configure about 91% of businesses use a public cloud. So this is something like AWS or easy to you know something that you're hosting installing on a publicly accessible cloud, but only 25% of businesses are cloud So once again 91% of businesses are in the cloud, only 25% are purely in the cloud. So that leaves a pretty significant majority of businesses who need to be concerned about this kind of cloud premise divide when they want to move their data around. Wow that's a pretty big number I didn't realize that it was, you know such a vast majority of the companies 91% were using a private or public cloud. So with that said, you know, obviously, what makes this cloud premise divide and hybrid data environments an issue that really organization should be concerned about or considering as they're approaching this, this infrastructure. Sure, so cloud hosted data storage platforms can be accessed from the public internet right and we're going to set this up as an important distinction here between cloud and on prem. So obviously, you know you need to be able to access and authenticate against a cloud platform you need credentials and there's maybe some other network security mechanisms in place. So it's not that your data is vulnerable to the public, but conceptually speaking data in the cloud can be accessed by anything that's given permission that's given the right set of parameters to connect. So that's different like I said that's in contrast to on premise systems and on premise data. So this, what we mean by on premise is that it's hosted internally on a local machine on the local on the company network. So even if you had in theory the username and password, let's say to access a SQL server instance, you'd still be able, you'd still need to be within that network you need this physical proximity to the systems that you're trying to access to even begin to connect. So most companies networks have all these strict firewall protections in place you don't want to treat your company network, as if it's on the public internet right that that's a network security nightmare. So the fact just remains that systems on prem are going to be conceptually harder to access than systems in the cloud. So things in the cloud might not have access to your on premise data. But no matter how you, you know, no matter how much magic you want to program into your cloud hosted services. There's this technical limitation there. Oh wow. Okay, so back to the data pipelines. So how does a pure cloud pipeline handle that hybrid data environment that you were just describing and you know the differences in accessibility. Yeah, so if you're, if your data pipeline is a pure cloud service, then unfortunately the answer is that it's not going to handle it particularly well. There are a few sort of demanding and technically sophisticated workarounds like using agents and installing them on your local network and things like this. But generally speaking, there just isn't that much that you can do from a cloud perspective. If you simply don't have the connectivity or the proximity to your local network or to your local machine, where your on premise data warehouses or data storage solutions reside. As a result, you know, practically speaking, what that means is if you're going to use a pure cloud data pipeline that pretty much forces you to adopt a total migration to the cloud. And that of course sounds pretty good. You know, most of us might dream in the future of having a pure migration or total migration to the cloud. But again, in a pragmatic lens, it's often more valuable for companies to still be able to use their legacy and backup data management solutions, even as they sort of slowly and over time perform this adoption of cloud hosted services, simply ripping off the ability to support your on premise systems can be a huge liability for companies looking to move and manage their data effectively. Very interesting. Those are really some good points that that are good to think on. Thank you for sharing that. I'd like to pivot just a little bit because I'd like to talk a little bit about scalability and understand help me understand why is it important that data pipelines are scalable. Right, so to understand that it's helpful to just reflect on the core point of data pipelines, and that is again to ensure the visibility and the comprehensiveness of your data. Again, that's when your data data is most valuable when it's visible and comprehensive. But as your business grows, your data sets become more complex your data environment grows you add new services, you might add a new sophisticated accounting system for example, or you might you know build in an automated marketing platform to augment your sales pipeline. And in order to continue gleaning insights from your data, your data movement approach, which is to say your data pipeline, it needs to be able to flex and bend and grow with your data ecosystem. With a static or inflexible data pipeline, then you'll start to lose the value of your data and your data movement as your ecosystem scales up. I see what you're saying so as your data environment continues to grow or your needs for data continue to expand, you got to be able to expand or bridge those business requirements together to still be able to provide meaningful insights and analytical, you know, processes on the back end that makes a lot of sense. So, talk to me about what makes a data pipeline scalable or what prevents it from being scalable. Yeah, once again there's sort of two ways to answer this question so maybe I'll start with one of them and get to the second. The first way is technical scalability so we're talking about technical limitations on on scalability, and the second is related to licensing and cost limitations on scalability so if we start with the technical framing. A data pipeline is scalable if it has two capabilities. The second way is the ability to handle lots and lots of data, you know huge volume of data you don't want the amount of data that you're moving to cause problems for your data pipeline in the future. And the second way is the ability to connect to new data sources as your data environment grows. Part of the job of the data pipeline is to establish these new connections to the new systems that you're bringing in so that it can scale up with your environment. Technical limitations as the first piece. Talk to me about how frequently do these technical limitations on scalability really come up. Yeah, so the good news is that rarely will a data pipeline explicitly rate limit your data throughput or break as the volume of data gets large. So that that volume of data that typically isn't really the concern when it comes to technical scalability, rather the concern that's you know worth having in mind is that second part which is centered around connectivity. Like I said in order for your data pipeline to service every element of your ecosystem, it needs to be able to establish those connections to each new SAS application and data storage solution and analytics platform that you want to add. A data pipeline that might initially satisfy your requirements. It becomes unscalable unusable. When you add new systems or expand your data ecosystem, if that data pipeline now isn't able to connect to these new systems and you've got more and more siloed data and data falling through the cracks of your data movement solution. But God, it says like showing this picture here where this person is trying to build a new data pipeline but apparently maybe they might be struggling with some of the connectivity required to access new data sources so now, now they can't quite get the robustness needed or the scalability for data pipeline if they're trying to access a new type of information that makes a lot of sense I can see how connectivity is a key component to successful data pipelines. The second one you mentioned was licensing limitations. Share with me what what your what the thoughts are on that. Right so beyond just technical like capabilities of your data pipeline when we're talking about scalability, very frequently, explicitly or not we're talking about cost. And so we have to understand the way that data pipelines and data movement solutions think about cost and licensing. Some data pipelines will charge you for the volume of data that goes to the pipeline, and this is called usage based licensing. And while on one sense it seems to make sense right in one hand makes sense that the amount of data moving through the pipeline is what you're charged on. The problem there is that it's sometimes hard to know how your licensing costs are going to grow as your data needs scale up in size. So, are you saying that you might be faced with a growing cost of business that you can't predict based with these usage based license is that what we're saying. Right yeah you might find yourself in a position where the initial cost for the data pipeline made sense, but as your data environment grew and your business scaled up. The cost became prohibitive or at least you know in efficient, but now you're stuck with that some cost of time and energy investment in a particular data pipeline. So it's hard to get out from under that decision as you realize the licensing scalability just wasn't there. I understand so as the, as the data going through the pipe can get larger, the cost becomes unpredictable, or on, you know or not able to be budgeted because the starting to increase and it's hard to have insights into that so you could be potentially monies flowing out the window because of the inability to have predictable pricing. So, with that said, really what would be the ideal licensing approach that is more scalable than the usage based license challenges that we were just talking about. Right yeah so usage based licensing isn't the only model for licensing and data movement solutions, right which is good news, because a more scalable licensing model is called source based licensing. And that's where the cost of your data pipeline doesn't depend on the volume of data going through the pipe, but rather the number of individual platforms and services that you need to move data to and from so now it's not the width of the pipe the volume of the pipe, but just the number of end points that the pipe is connecting. So that makes sense. So what makes this approach more scalable then. Well, so as your licensing, excuse me as your data environment grows, your licensing costs will still probably grow, which makes some sense, but the critical difference here is that now you have a lot more visibility into the rate at which the numbers by which the cost of your data pipeline will grow as your business scales up. Right with source with usage based licensing. It's hard to get a sense of how much volume of data you're adding right and who knows these kinds of numbers off the top of their head the specific throughput of their data movement. But when you're dealing with source based licensing you can easily know the cost of adding some extra element to your data ecosystem. You know you always know how many systems you're planning on introducing to your ecosystem. You always know the cost associated with any given system or sorry any given new end points in your ecosystem. So you can easily forecast and budget for the increasing cost of your data pipeline based on how many data sources you need as your data environment scales up. Excellent that makes a lot of sense so you don't have to worry about the the width of the data going through the pipe. You only really have to worry you know have to be concerned or have to be conscious of how many connections you want to you want to feed into the pipe which definitely to me makes easier budgetary and predictable pricing. Exactly it's not about the fact that you're not having to you know scale up the cost but rather that it's very clear and visible and understandable how that scaling is going to go. That's nice. Right and I think at this point Kim I might have talked at length about some of the details of data pipelines and in contrast to ETL. I do want to maybe flip around the format here and if you don't mind ask you a question, which is, we've been talking about data pipelines a bit in the abstract, but of course, we want to know how does see data approach this data pipeline issue. Is there a particular way that see data provides that data movement in light of what we've said. Well that's I'm so glad you asked that for me with to me mad thank you so much and with being the product marketer for sink. Yes, we definitely have a way to support data pipelines and we do that through our sink technology offerings. We do this because we understand that modern data users need a reliable way to derive and execute decisions right and they don't have they don't want to have to wait around to get to the data they need when they needed it they needed at their fingertips. So see data addresses the demands for the contemporary data community today because you know a variety of different type of reasons, one we install effortlessly with no lengthy it processes, no expensive implementation cost it's a very simple. Implementation integration component. We are technology is easy to use so you don't have to have special coding or needs specialized skills to be able to use the application so we appeal to a wide variety of data citizens out there. Another key component that we provide is that we provide real time. Yes, I said real time access to whatever data you need, wherever it lives whether it's in the cloud like in a snowflake system sales force MongoDB Cassandra. It's an on premise and a legacy environment like Tara data SQL server Oracle, or you have data lakes out there like Hadoop or Azure data lakes, plus more, or even event based systems like Kafka. So what we help you do is we help you conquer that cloud premise divide that we were talking about earlier with Matt. In addition, we secure the data movement across the environment because we landed no data is merely a simple trans transfer from one application to the system that you're wanting to work with. And we encrypt all of that data as it's going in transit so you can have the cable feel comfortable that the information going across the environment is secure and allows you to sleep at night not having to worry if did I have data leakage someplace so we making sure that we're keeping that information secure. In addition, we have over 200 plus fully managed connectors in that we're managing all of the connector nuances that exist across the disparate code requirements for each individual application. Which means that we're taking all that maintenance off of your plate for those connectors you don't have to worry about. Oh my gosh, I have to change an API. Oh my gosh, I have to go in the right hood. No, no, you don't have to do that with us. We're doing well, finding fully managed connectors, which alleviates a lot of the maintenance headaches. We also provide flexible deployment. So if you want to install sink onsite, you can behind your firewall, you can install it in the cloud if you so choose and you can use data, regardless of where that information is, whether it is on premise or on cloud, regardless of your deployment method. So that way we're meeting a variety of different type of business requirement needs. And then once again, we talked a little bit about that source based license. That's how see data approaches our data pipeline in that we only will charge for the number of sources that are being leveraged or applications rather than the usage based license that we were providing before. Hence we provide that predictable budget, their budgetary pricing model that many companies are looking for today. So there's no surprises when the bill comes at the end of the month. So that's enough about see data and the great wonderful things it does, but I would like to ask Matt, would you mind showing the audience what see data sync looks like and how we approach data pipelines in that cloud premise divide that we were talking about earlier. Sure, Kim. Yeah, you gave a great overview there of the way that see it as sink is designed to address a lot of the concepts we've talked about today, but there really is no substitute for simply seeing what sync looks like what it can do, and what it looks like to work with. So we are going to jump into a live demo here where I will configure see data sync in front of you. I've got a fresh install of it to start from scratch. But before we jump right into the view of sync I did want to briefly give you the overview of the kind of business use case that we are demoing for you today, the kind of setup that we want to have in today's demo. So to start, it should be no surprise of course that we have seated a sink here this is going to be the star of the show, and we'll spend most of our time within seated a sink doing our good work. So first we have Salesforce data, you know, in this example, I'm going to be referencing the accounts table within Salesforce but of course it really could be anything in your CRM and your sales enablement platform that is relevant for your business insights right whatever it is that generates data that you want to be able to analyze understand backup aggregate and so on so in this case we're going to use our Salesforce cloud as a sample starting point for the information that we care about. To reflect how a lot of businesses operate it's not just the data in the cloud that we care about. We also have an on premise system, and in this case as an example we're going to use a local SQL server instance that I have running on my personal machine. So it's going to be sitting here not in the cloud at all, but rather on local host. So this is the source of the data that we're interested in for this relatively simple setup. So here we are about Salesforce and SQL server data. Now the question is, where do we want to move it to. After all seated sink is a data movement platform it's a pipeline where's the end point of our pipe. And for an example today we're going to use snowflake as our endpoint we know this is a popular and common data warehouse. And right now we want to both be able to copy and aggregate our Salesforce or cloud hosted in our SQL server or on prem data into our sort of aggregation point here in snowflake. So the final setup for our demo use case is using see data sink as that a middle point the data movement the pipeline between where our data is Salesforce and SQL server, and where we want it in snowflake. So with that out of the way with that said, I'm going to jump into a live view OC data sink. So hopefully we can see sink here. The first thing to notice is this is a web application about to sign into the web portal. This is running on my local machine you can see their local host. There are cloud hosted options as well. But of course part of what we want here is the ability to access my local SQL server. So I'm using this installed on the network version of sync so I'll go ahead and sign in, and you can see the web portal for sink here. So first of all you're seeing a dashboard that shows the usage of sync. Like I said this is a fresh install so this is a little bit boring at the moment, we want to learn how to immediately jump in and start working with sink. So we want to understand the two part of workflow the two part process for establishing these data pipelines, and those two parts are going to be a first setting up connections, and then setting up jobs and these appropriate tabs up here at the top. So we'll start with connections. All right, so the first thing we need to do is talk to our data sources and our data destinations. We have no connections configured yet. So we just see the only option is to add a connection, and we see a list of data sources that we can connect to. Now, the list is okay, but it might seem a little bit small if you know we're trying to capture every different elements of your of your data environment. So it's worth noticing that while these are popular options that come sort of pre installed with the application. We have a much more comprehensive list here, which you can access very easily by simply clicking and downloading and installing. So, you know we have to balance the ability to show you what you need versus filling up your entire webpage with all of these icons. So if you're curious about why your particular data source is not showing up on the screen. It's very likely that it's contained within this much larger list here that I'm looking at. So for right now, we care about Salesforce and SQL Server, and these are both just simply available in the main menu. So that's what we're going to do we're going to create a Salesforce connection. All right, so here I have the Salesforce connection parameters, and you'll notice it's pretty small, there's not a whole lot to fill in here. The reason is because we use OAuth as an authentication scheme with certain SaaS platforms like Salesforce. If you're not familiar with OAuth, it's an approach to authenticating where we don't enter our credentials and see it as sync directly, we never tell sync exactly what our credentials are. Rather, we have cd to sync redirect us to Salesforce to the login portal essentially, and we enter our credentials there, which is essentially the same feeling as if you're logging into the web portal for Salesforce right we see a Salesforce screen we enter our Salesforce credentials. Once we've done that Salesforce redirects us back to sync, but with a little token that says hey you are who you say you are, you're allowed access to the Salesforce instance. So let's see that play out here in real time. So I'm going to hit this connect to Salesforce button will get redirected to Salesforce. We can see this is the, you know, regular Salesforce portal. It's got my credentials already filled in, but for some reason the Chrome's auto fill doesn't actually work for this, maybe because I have fairly different usernames so I'll copy this in and login. So there we saw sort of behind the scenes, it didn't look like much was happening, but Salesforce was accepting those credentials that I put into the portal redirected back here to sync and we see this success message, indicating that cd to sync now has that little token indicating that we have authenticated directly with Salesforce, and now sync has access to it. We can see first of all that this button has now changed to disconnect from Salesforce if we want to terminate this. But for right now this means our connection is working, we can talk to Salesforce and I'll go ahead and save those changes. So the first step is done connecting to Salesforce hopefully it was clear that that was essentially the same process as logging into the web portal, not particularly demanding here. And the next thing that we want to do is we don't just want data from Salesforce we also want data from the local SQL server instance that I have running on my machine. So the next connection to add is again a source and SQL server. The SQL server doesn't have a lot because it's not a web server that you redirect to and from, but it does require a simple set of properties here to connect and to authenticate. So there's going to be a brief lull here as I copy and paste from off screen, the properties that are appropriate. Hopefully you can see from the names and labels of these boxes that this is fairly simple information. So here's the schema that I'm using a simple set of user password credentials and the web server where this is, or not the web server excuse me the machine where this is running, which as I mentioned is my local host. Now, I happen to know that since I'm using this the local connection with my windows authentication user that does mean I need to go briefly into the advanced settings to set the off scheme to ntlm. I'm familiar with, but if you're curious about these various settings and where these values might come from within your setup, then it's worth pointing out that this online documentation button will take you directly to a page where you can understand each of these properties and maybe where where you might find them if you don't already know what the value should be. So unless I've messed something up with all of this configured I should be able to connect and there we go. So this is just again talking to my local SQL server instance on this same machine. And so that's all we need to talk to SQL server we've got our two connections going Salesforce in the cloud SQL server on Prem. Again, hopefully it's clear that I haven't had to do anything particularly complicated yet. And the next thing to ask is well where's the final connection where's the destination for our data. And that of course in our use case is snowflake. So I'll head over to the destinations tab and find snowflake which is one of our cloud hosted destinations. All right so once again it's not an OAuth flow. And I will take a second here to copy and paste from off screen, some simple connection parameters. We just need to know where the database is what schema we're using within there. Some basic user password stuff. Some more narrowing down of what exactly we're connecting to with the warehouse in the schema. And we can test this connection. Awesome. So before we exit off this screen I do want to actually show you this snowflake instance after all this is where we're moving our data to so you can see we are actually moving data in real time, you know this is live. So before we close out here I do want to show you this snowflake instance. Nothing particularly exciting going on here right this is an empty database that I've called Matt's DB. So we want our data it to end up right so in an actual use case of course this would be a more sophisticated, but for the purposes of a relatively simple demo, we're going to see that there's currently nothing in our snowflake instance, at least not exactly where we're telling it to go. Because again this is, this is where we're targeting here. All right, so we've already tested this but I'll just make sure I didn't miss anything. And we can save our changes. So three connections. I don't know how long that was five minutes maybe I talked for a bit so maybe longer than that. And we're talking to every different connection every different elements of our data ecosystem that we care about we have a source connection to Salesforce in the cloud, SQL on prem, and of course snowflake as a destination in the cloud. So that wraps up the first of the two step process for moving data with C data sync that again was the connections, and then into the jobs. So we've got our three connections those are all working we saw those success messages, and we can transition here into the jobs. All right, so again there's no configured jobs this is a fresh install or at least it was an hour ago. So we need to start. So let's understand that jobs mean replication jobs so see data sync is an incremental replication tool. It takes the data from our sources and intelligently replicates them. So that only the new and change data in our sources gets replicated to our destination so anytime changes happen in Salesforce or SQL server. See that a simple find that out and copy those changes back down or back up to snowflake in this case. So let's see what it looks like to actually build this job. All right, we'll give it a name. Let's call it j one for simplicity. And we tell it what connections we're using. And in this case, it happens to already understand that we want Salesforce as our source and snowflake as our destination helps that we don't have very many other connections configured yet and replication type will just keep as standard, and we can create this job. So there you see that the jobs are built on top of the connections that we built in the first step here. And now let's understand how to flesh out this job so we copy the data that we want. So we use this model of tables and columns to represent data objects. If that feels weird to you because you're not, you know, much of a SQL person for instance, you can think of these as objects and properties. So when we say there's a data table that has columns, we're also just saying this is a data object that has fields or object that has properties right it's the same sort of tabular model. So when we say tables here, what we're asking is what data object in Salesforce, are we looking to replicate in snowflake. So we'll go here to the add tables. See the sink is going to use metadata from Salesforce to understand what the data objects are. So here we see a comprehensive list of everything in Salesforce that you could access right so this is Salesforce telling us, here's the kind of data that we have. What data do you want. And for this example, we're going to use this nice handy account table it starts with a it's at the top of the list. And here we see we've added a new table to be replicated the account table. Let's understand what that actually looks like what that actually is before we go ahead and run the replication job so I'll click on this account table that I just added. You can see it's sort of pseudo represented in SQL with this replicate account. What does that actually mean. Well it means that the accounts table in Salesforce has a set of data that we want to incrementally replicate into our destination of snowflake. So we need to know a few things about it, but all of the things that we need to know sink can automatically detect seal notice I'm not actually configuring anything here, just showing you sort of what underneath the hood sink is handling for us. So of course it needs to know the primary key or the index, and it needs to know this incremental check column I do want to take a second to explain this. And you'll again you'll see that this is already filled in because sink is smart enough to know what this should be for the specified source which is Salesforce. What the incremental check column means is, we don't want to have to replicate all of our data over and over again that would be a horrendous waste of time and energy to copy all of our data. So what we want to do is copy our data, and then anytime new data comes in or old data is updated. We want to capture that change and replicate it back to our target, our destination. And so this is the column or the property right if you think of it as a data object that sink knows is going to is going to let it's going to tell us how recent a record is in Salesforce. Every time this sink ask for data it knows the time stamp that it did. If something is updated or a new record is created that is more recent than anytime we've gotten data from Salesforce sink is smart enough to understand oh well that's new data, I need that I haven't gotten that yet, and therefore it's using this column to make sure that it's being smart and it's incremental replication. So we go into the column mapping, this is just saying hey what's the property name and the source and what should it correspond to the property name in the destination, but this is all you know fine and good for for a demo here. Just letting you see that if you wanted to sort of be able to finagle this, it's available. So this is the simple setup for a replication of the accounts table. I did talk a bit about it, but you'll notice I didn't actually have to configure anything. All these defaults are sort of auto detected and are good to go if I want to replicate this account data into snowflake so hit okay. All right so let's actually run this job. And just, again, the whole point of this is to get our data into snowflake so I'll refresh the page just so you can see that there's nothing funky going on. And let's run this accounts replication. So we're establishing a live connection. It's running. We're asking sales first for data, and we didn't replicate any accounts. We did. Okay, so we're reporting. I see. So we're reporting the. Interesting. Well, I'm not sure why it's reporting zero rose updated but here we can see the account table created in snowflake, which is what we wanted. It's possible that the incremental update is not detecting anything is as newer than we want so that would explain the zero, but regardless the point is that we're getting our new accounts table here in snowflake from our live configuration to Salesforce. So if I had an extra second to think about it off off the webinar I could figure out what's going on here. But for right now, let's move on to the second part of this replication task, which is our snowflake data, or sorry, our SQL server data. We have our Salesforce connection and our job. Now we want our SQL server connection in our job. So this will be J2 again for simplicity. We have our local SQL server instance we saw I just created that connection. And once again we'll do our standard replication. All right, so in this case, we want to add, we want to follow the same process which is this is a table based replication model. So we'll add our SQL server tables. Now this is just a really simple local SQL server instances, you know it's not a real, real database really so it's got very little data in it and very few tables. It gets the point across that we're accessing my local SQL server. And so let's pick the leads account which I think sort of compliments the accounts data that we're getting from Salesforce. And once again, we can click this and run the replication. So it'll establish that connection to SQL server and replicate the data. Okay, here we go one record affected. That's also a low number, but that just happens to be the way that my SQL server setup is, we can see here, this one record in my local SQL server instance. This is SQL server management studio in case you haven't seen it. So this is that leads table we were just asking about, here's the one lead that we have which of course is me. I'm a very promising lead in this case. So that's the one record that we're capturing with this replication process. And if we go back into the snowflake instance we can see the leads data is being captured there. So once again we had our two different jobs because we have separate connections. One is capturing the Salesforce data, one is capturing the SQL server data. So that's all that you need to know for the work involved in building a data pipeline with sync with C data sync. Now hopefully it's fairly clear that all that I really did was plug in some connection credentials, pick certain data objects that I cared about, and let the job run. But you might also be asking, well, I mean I don't want to go in and run replication jobs right I just want to be assured that my data is flowing freely from where it is to where I want it. And so the last thing to mention here is the schedule tab. This is available of course in all of our jobs. And this is where you could enable a scheduler, excuse me, to tell C data sync how you want to, or what schedule you want to run these replication jobs. So you can get all the way down, you know to the minute. If you're really fancy, and you want to enter a Chrome expression of course you can do that. And the point here is that C data sync will use the schedule to run behind the scenes automatically, and be doing this intelligent incremental replication, so that any changes in your source get reflected in your destination, as C data sync is sort of humming silently in the background. You don't have to worry about when your data comes in or where it's coming in, as long as you set up these simple connections and jobs. So all of your data is going to end up in snowflake the way that you want it. And you can just treat snowflake as that central repository for where you analyze or report on, or understand your data. So hopefully with all of this setup and said, it's clear here, the approach that C data sync has for data pipelines, which is maximizing the simplicity of setting up these connections setting up these jobs and then just letting it home in the background. So Kim I think that pretty much wraps it up for a demo of C data sync. Great, thank you Matt that was very easy to see how a data citizen could take charge of their own data pipelines to get data from one application or environment into another source or business application Power BI Excel whatever else they might want to analyze data in with the simple environment and they could schedule that to run as much as they want. Again, and I'm going to turn the presentation back over to you for questions and answers. Absolutely thank you both for this great presentation I love the interview format very engaging we have a lot of questions coming in if you have questions for Matt and can feel free to submit them in the Q&A portion. I will send the most commonly asked questions just to know I will send a follow up email for this webinar by end of day Thursday with links to the slides and links to the recording along with anything else. So diving in, is it an assumption that your data citizens understand the context of the data supplied by the data pipeline. So that's a good question. And the answer is yes, but the answer is yes is a fine answer to that question right so our data citizens, their job is to get their hands on the data, and to work with it in a way that's valuable. So what we really want from our data citizens is the ability to understand and be familiar with data to be those data gurus, and therefore to understand the context of it, but without having to deal with the technical intricacies right we don't want to ask our data citizens to be engineers or IT specialists. So asking them to, you know, I have an understanding of the context of the data that they're working with to draw insights from, I think that's a that's a perfectly acceptable ask. Yeah, and to pivot off of that to in reference to what you would just saw in the demonstration is that you know you don't need to have specialized skills to access, you know a data source and a destination right and you get access you can see what's in the goals or objects of each individual, you know source or destination so that way you can even, you know pick a mapping or just pick the data elements that you want to come across without really having to know, you know this the ins and out of a particular data component. So, is there a quote unquote control plane where you see all your pipelines running how they're performing and other key aspects for monitoring alerting and supporting etc. Yeah, so I can handle this so if we drop back in to see the sink here for a second. I didn't spend very much time on this page so it's understandable if you didn't see it very much, but this is the status page where we contain exactly the things that you're looking at so usage metrics, histories and logs for running reports and debugging and stuff like that. So yes there is this sort of status navigation page. I just priest past it a little bit because at the time we didn't have anything configured. So it looks kind of silly with with just nothing on it. So, I read that scalability increases hackability and API is this true of a scalable data pipeline. I'll confess I'm not precisely sure. For the context of this question, generally speaking, see data sync uses other systems API is to communicate. So you know this is getting a little bit more technical but when we're talking to Salesforce, we're using the Salesforce API. I can imagine that it's true that the more scalable and API is the easier it is to hack. But I guess the important point to realize is that see data sink in this instance is not the one building and exposing the API, we're the one consuming it. So if there's a vulnerability on the API part. You know it might be Salesforce's issue but see data sync is not vulnerable that kind of attack. In addition to that too since we're encrypting all the data that's moving or flowing as part of the pipeline that you know there's another level of security for that information and in addition since we're not landing data anywhere you don't have to worry about it being exposed in some type of third party or predefined, maybe cloud based data system that's required like some applications are out there. And since we're not landing any data, it would be very hard to, you know, that being that that removes another access point is what I'm trying to say for security issues. Perfect. So, I'm going to, we've got just a few minutes left I'm going to try and get at least one more question here if not to those do the citizens need a pipeline or an access path or both in the later case. Yeah, so it's a great question. And I'm sure I can certainly imagine cases where an access path is fine is great. Right. The use of a data pipeline, the value of the data pipeline is those moments where data isn't accessible and the place that you would like to work with it. This is going back to the both the visibility and the comprehensibility right so if you want to aggregate data, because the things as relevant to you is not just specific data on the accounts table, for instance, but also the leads that might have, you know, generated those or, you know, various other resource data about the sort of way that your company is handling these accounts, the ability to aggregate and therefore get a 360 degree view of particular elements of your data might become more valuable and just being just being able to access that data in a sort of fragmentary way in an isolated way. So that's that's one answer. Another answer is it's convenient it's it's nice if you're if you're data citizens are skilled in a particular tool like I don't know to power BI or Tableau or something. If you have a data pipeline, then all of your data is generated, and it just flows through that data pipeline to where your analysts want to use it. Analysts no longer need to care how many different systems you're using, you know, all of this technical setup of your IT infrastructure, what they care about is the data is in the platform they want to access it in. And it's just always going to be there with sync running in the background. Love it, but that is bringing us to the right to the top of the hour here. I know there's so many more questions I'll get these over to see data so if you have any additional just feel free to put throw them into the Q&A there. And again just reminder I will send a follow up email by end of day Thursday to all registrants with links to the slides links to the recording, and I know on the next slide there you've got a link to learn more about see data, including I know there was a question in there about the pricing model and such. Thank you both so much for this great presentation. And thanks to all of our attendees, and hope you all have a great day thanks to see data for sponsoring today. Thank you so much. Thank you so much. Looking forward to having a nice day and everyone thanks again for everything. Thanks y'all.