 Hello and welcome. My name is Shannon Kemp and I'm the executive editor of Data Diversity. We'd like to thank you for joining today's Data Diversity Webinar, Data Prep, a key ingredient for cloud-based analytics sponsored today by IBM. Just a couple of points to get us started. Due to the large number of people that attend these sessions, you will be muted during the webinar. For questions, we'll be collecting them via the Q&A section in the bottom right-hand corner of your screen. Or if you'd like to tweet, we encourage you to share, highlights or questions via Twitter using hashtag Data Diversity. As always, we will send a follow-up email within two business days containing links to the slides, the recording of this session, and additional information requested throughout the webinar. Joining us today is Fernando Borda. Fernando is the offering manager with IBM Cloud Data Services. He's a technical leader with over 20 years of experience in the software industry and other computer science-related areas. He has been with IBM since 2005 where he has held several management positions in software development and product management. Through his tenure at IBM, he has specialized in information integration technology and most recently in cloud services. And with that, I will turn it over to Fernando to get us started. Hello and welcome. Oh, Fernando, I think you're still on mute there. Thank you, Shannon, and thanks everyone for joining us today. So today, I'm going to be talking to you about the space of data prep, what it means in the context of analytics, and what it means to users in general. So with that, let me proceed. As Shannon correctly described, my name is Fernando Borda, and I'm the offering manager for Data Works, part of the IBM Cloud Data Services. So let me talk a little bit, start a conversation about talking about who cares and what is that defined the problem of data preparation. In terms of data preparation, this is a soft field of order of the more general information integration space. And with some very specific characteristics that make it different from the kind of problems that we have in the classic data integration space. So let me start by introducing some of the problematic that we've seen here. One of the key problems is data scientists and business analysts are spending today most of their time trying to get data, finding, cleaning, aggregating that data, and in general getting ready to getting their work done. So we have highly paid professionals that are spending a lot of time just getting the data ready for doing their work done. They aren't actually spending that time doing the analytics that they paid for, but instead they're spending most of their time just finding and cleaning that data. The second part of this problem is that while in the classic data integration space, you can find that most of the data is on premise, on systems or records. With the advent of several different technologies like cloud, big data, IOT systems, data is no longer just confined to the walls of the IT department. So still most of the corporate data is within our data centers and it's talked away behind the firewalls, but at the same time we are seeing that there's more and more data coming from the cloud in the form of social data, IOT data, and in general data that is born in the cloud. In order to unlock that potential of that data, what users need is the ability to tap into that data, correlated, blended, combining with the data from systems or records to be able to produce insights. And that's actually driving some of the use cases that data preparation services. The second thing is, this is correlated to the first point that I was talking about, the amount of time that the data scientist and the land business user are spending all this time, is that part of the reason that they are seeing this is because in this traditional ETL requires skills that aren't available to this line of business users. Some of the wait time, some of that 80% of the time or 40% of the time that the data scientist and in general the citizen analysts are spending, is because they're spending time waiting for IT to fulfill their requirements. So as I'm trying to get data for my reports for my data science, the classic way that ETL works is that you would interact with the IT department in order to unlock that data that is somewhere where you want to access it. And that takes time. And part of the problematic is that this introduces weeks, sometimes even months of wait for getting that data. The business does need access to the data now and need access to the data in self-service. Finally, while data scientists, data science is a very promising role within our organizations, the reality is that most of the data scientists are just a fragment of the workers that we have. We cannot rely on data scientists to provide all the analytics in the world. There is a ratio of about 1 to 100 between data scientists to business analysts. We need to unlock those business analysts who also produce insights and produce analytics from the data in a self-service manner. So when you put all these four aspects together, you have sort of a perfect forum that really defines the notion of data preparation. This notion that users, whether they are business users or data scientists, need to be able to self-serve their information requirements, to be able to find the data, to be able to combine, aggregate, cleanse that data that they need for their business, without requiring help from an IT department is what the problem for data preparation represents. So let me jump over now to a little bit of what we need more concretely with regards to what are the use cases that represent the data preparation space. The first one is, you know, it's correlated to one of the quadrants that I was referring to when we're talking about the problem space. Users need the ability to get data wherever it is and combine it. So I need to, for example, if I'm evaluating the results of a marketing campaign, I need to be able to get data from the likes of Facebook and evaluate customer sentiment and what the impact of that customer sentiment is to my sales. So I need to be able to get data from my systems of record like my sales databases. I need to be able to combine that data with systems like, you know, Twitter to evaluate, Twitter or Facebook to evaluate customer sentiment, correlate that, blend it together and load it someplace where I can do analytics. So that's the first use case when we're talking about blending data from multiple sources. Of course, one of the next point here is data, while there's more and more data in the cloud, the majority of the corporate data lives behind the firewall in our data centers. In order to have successful data preparation, we need the ability to access data wherever it is, whether that data is talked away behind a firewall or that data is in a cloud facility, whether it's file storage or behind a cloud service or behind another environment. I need to be able to access the data wherever it is. The third use case is the ability to shape raw data for analytics. When you start putting together data from multiple sources, let's say, again, working on the example that I was discussing, I'm getting data from Twitter for doing a customer sentiment analysis. What you will find typically is that that data comes with different levels of quality, comes in different formats, it's in general raw data that left alone will not be able to produce, you will not be able to produce good analytics from there. So what customers need is to be able to massage clean shape and make sure that that data is relevant so that they can produce good analytics so that the model here is as you're feeding garbage to your analytics, you will produce bad analytics. So in order to produce good relevant and insightful analytics, you need to be able to shape. What shaping means is I need to be able to set the data in the right format, I need to assess what is the quality and relevance of my data, and I need to be able to do this in a dynamic way by looking at the data and starting deriving and shaping the data in general. On over to the next page. Another one of the use cases is just the ability to load data for analytics. Sometimes data scientists in particular, they just want to access a corpus of data, a bigger corpus of data and put it into wherever their data warehouse on the cloud or on-premises and just start doing analytics there, whether it's with R or other statistical facilities to be able to load that, to analyze that. Then the next use case that we're talking about here is within the whole ecosystem of applications and analytics ecosystem, then users demand the ability to be able to integrate what I do with my data with things like application development. So as my data scientists and citizens that already create workflows to move the data from sources to targets, I also need to be able to have the ability to integrate that within applications and control that load workflows from applications on the web. And finally, and again, this is related to what we're talking about before about the ability to use both structure and semastructure data. It would be the ability to tap onto data that comes in whatever format it is and be able to normalize it and co-related to data that comes from anywhere, whether that is structure data or data from logs, data from JSON files, data from documents, PDF documents, Excel. Whatever the format it is, we need that ability to tap into that data, load it and start building analytics from it. So these are the set of use cases that we have within the data prep, the basic use cases that we have within data prep. What we're going to be doing for the next 15, 20 minutes is we're going to start looking at, in practice, how do you mob this to some of the offerings that we have within IBM. And I just want to make this an interactive session of how do you put this in terms of how would a data scientist or an analyst use a data preparation tool to get their job done. Within that, I'm going to be doing a small brief demo of our survey data works. I'm not going to be talking too much about the product itself, but this is one of the services that we have on the cloud to be able to do data preparation. And, you know, basically the job to be able to do the shaping and the preparation of data is based on a couple of premises. We have the ability to access data anywhere it is, so whether that data is on-premise. And then load that data into a spreadsheet-like environment where the key ingredient there is that I am directly looking at the data. So as a city analyst or business analyst or data scientist, I can load that data and start looking at that data, start understanding what the quality of the data is, start understanding what are all the characteristics of the data are completed on the spot, and shape it until I'm satisfied with that, and then deliver it to wherever I am going to deliver it. So whether that is a data warehouse on the cloud or that's a visualization tool such as Watson Analytics or some of the other tooling that is out there for data visualization, then I deliver that data in there. And then the other aspect of this is that process of shaping the data for productive use is something that is iterative. And when we talk about iterative, we talk about a couple of different things. So number one, it is a trial and error process. So it's a process that is very dynamic. When I'm actually shaping my data, what I want to be able to do is read data from multiple sources, find out whether it's the relevant data, understand if it's the right data or I need to go find some additional data for modifying, getting my job done, or otherwise, once I'm satisfied with what I have accomplished, then I can upload data and iterate over data over time. So I can operationalize the execution of that workflow and have something that delivers data to me on the map. The other thing, and this slide is slightly technical and product-related, but the key point that I wanted to deliver here is that one of the crucial things to having successful data preparation is the ability to reach to data behind a firewall. So one of the key things is security. And especially when you're talking about data on the cloud, to have the ability to reach into behind a firewall, have the ability to tunnel through firewalls and get that data wherever it is, whether it's on the cloud or it's data that is in the systems on-premise. Let me switch now to the applications here. So a couple of different renderings of what we call data prep. You might find, and the original thing, the original domain of data preparation was more targeted towards visualization and visualization and tools that allow you to do visualization. So what I'm going to show you here is one of our products called Watson Analytics. But what I want to reflect in here is data preparation embedded within a predictive analytics and visualization tool. So here I'm in my Watson Analytics dashboard. And what I want to do here is just upload data. So here I have a number of connectivity to a number of databases in particular. Here I have one for a demo. So what I wanted to show here, I'm just going to skip this part, but what I wanted to show you here is the ability to do the data preparation within the context of a product. I must have done something wrong with the setup. But I'm going to show you, show next is the exact same thing in the context of a service that is strictly dedicated to data preparation. So this is our DataWorks data preparation service. A couple of things. I mean, we were talking about connectivity. So the idea of how to configure things here is you have connectors that allow you to connect to a number of different sources. So what you would have to do as a user here is just comment and set up your connectivity to your preferred database or facility on the cloud. You can just set up a number of parameters. But in the interest of the demo here, there's really more concerning on what we want to do in terms of data preparation. I'm going to just go ahead and jump in to look at some of the data that I have already. I'm going to load here. I was talking in one of the examples before about analyzing marketing campaign data. So I have some data that I have loaded into the database. And I'm going to select that data here, and I'm going to go to refine it. What is interesting here is that now I'm, you know, able to visualize the data directly in my space. And some of the key things with data preparation is I can start learning more about my data. So as you see here, there's a reason that it's giving me some information about, you know, the data quality for my data. Tell me, you know, what is the number of columns that I have. But more importantly, I can start, you know, looking at the data itself on my screen and start understanding how that data is. So I can start looking into, you know, how to drill into that data. And some of the things that we were talking before about being able to blend data. So I'm going to bring, you know, go add, look for additional data. This data, you know, I'm going to look for additional sales data, just selecting it here, add it to my space in here. Then the next thing that I want to do is just comment and join this data so that I can do my marketing campaign analysis. So in here, I'm going to just select join the two, join the two data sets, specify what is the key that I want to specify to, you know, matching of, do an inner join for this data. And then immediately I'm going to get that into my environment as a combined data set. The next thing, I want to start looking, drill it into, you know, how does the data look and whether that data is the right data. So one of the first things that I, you know, pulse my eyes here is there is this column called amount span. And it looks like there's, by looking at this, drill down of the quality of the data, I think that there is apparently some anomaly in here. So I'm going to come in and start looking at the data. And then, you know, once I bring in the value distribution, I can understand that there is a lot of values that simply have no sales data. So I'm going to go ahead and just remove this and then just do a filter. So it's going to take a couple of seconds. And then after that, the one thing that you can see here is as I'm looking at the data, then, you know, the quality scores are being modified out of this. And I can do, you know, some more additional, for example, I can remove some of the values that would be unwanted. So for example, as I'm looking at the sales region, I can look that there's, you know, most of my values are in one particular region. I can, you know, do some filtering for that data. And, you know, the other thing, when we're looking about relevancy and moving the right data, then I can start, you know, when we talk about shaping, I can remove some of the columns that are in my data set. So I'm going to remove some columns here. Then, you know, I can continue doing, you know, multiple things to the data. But, you know, the key thing for me here was to illustrate how in a dynamic way a user can come in and, you know, just look at the data in general, start assessing what the quality and the relevance and the completeness of the data is. And then, you know, until, you know, process it, shape it until they're satisfied. And then, you know, just deliver it to some place. In this case, what I'm going to do is load this data into a warehouse. And I'm going to select a schema. And then I'm going to just run the data in here. Hold on. Give it a name to my activity a bit. And then I can just run my activity to deliver those values. With that, I had planned to do some additional demo with, you know, one of the other products. But I think, I mean, as you can see in here, what this illustrates is the relevance of what data preparation is and how that plays into analytics. So I'm actually going to call it slightly shortly with this, I'm going to wrap up and then we can switch to 10, 15 minutes of questions. So Shannon, if you can actually open it for questions. Sure. We have a lot of good questions coming through. Let me, and of course the most popular question we get is question on the slides and the recording. Just a reminder, I will send a follow-up email within two business days with a copy of the slides, the recording and anything else requested throughout the webinar. I'm just trying to answer a question here. There we go. So first question coming through for you is how or does it, how does data prep or does it different from all-time ETL? So that's a great question. So all-time ETL and data integration are essentially the realm of IT. So that's part of what I wanted to illustrate with the demo, is that usually when you're doing ETL it's more a prescriptive development process where you have somebody that really understands what your data sources are and they actually start doing coding and following, if you will, a recipe for cleansing and transforming and publishing that data. Some of the key differences with data prep, which is a subspace within data integration, is the ability to do that within a self-service environment. So data preparation is really targeted at the line of business users. So the users that do not have those technical skills that your ETL developers would have and users that are actually trying to get and work directly with the data. So those are users that understand the content of the data, but they don't necessarily know how to code in an ETL tool. That's essentially the difference and where some of the differentiators with data preparation is that this type of user, the business analyst, and sometimes the data scientists, although data scientists are more technical users, they spend a lot of time waiting and trying to get access to that data, to cleanse that data, and they just simply don't have the tooling that allows them to do it by themselves without having to resort, go into an IT department to, you know, ride those ETL jobs to deliver the data so that they can get their work done. Thank you so much, and I'd love this next question. Where can I find ballpark pricing for data works? I have a client that could use these capabilities, but the value equation needs to hold maybe a subscription per user by data volume, et cetera, would be helpful. So that's a great question. So the disclaimer here, we're not really doing just the product demo, we're just having a more wider discussion about what data preparation is. You do have my contact information, and you can reach out to me directly. Otherwise, if you go to bloomix.net, you can look for the data works, data preparation service, pricing information, pricing information is available there on how to subscribe to the service. There's a couple of different flavors on how we can deliver the product based on capacity and so on. So pricing is available on bloomix. Love it. Now, I think you covered the answer here, but I'm going to ask you just to, so maybe you can go over it. There seems to, this seems to assume you know where the data is. Do you have a data discovery solution as well? So, yes. So within, so we just launched data works in the last month. Within the next couple of weeks, we will be introducing some of the data discovery capabilities. But that's another one of the key areas within data preparation as well, where, you know, some of the functionality that users are looking for is not just the ability to get, you know, get and shape the data, but they need the power to be able to find that data somewhere. And, you know, with that, our vision of that is give those users the ability to collaborate over data to describe the data, to tag it, to do, you know, kind of crowdsourcing over the data so that they can, you know, themselves, you know, provide that information and help on that better experience of finding data. So does it also support moving data from cloud sources to on-premise data sources? Yes, it does. It supports, you know, at the moment we have limited number of sources. Right now we support some of our PDA. So peer data for analytics we support moving to from the cloud to on-premise. Over the next few months, we will be adding additional sources and targets and additional connectivity. Very nice. And, you know, as you've been building this, you know, what's been your experience so far? You know, what will line of business users have the skills to do this sort of detail work? So that is the major point around data preparation, really, is to provide an environment and provide those capabilities for the line of business users to be able to do that. So when, you know, when you're looking at most of the tools within the data preparation space, they're targeted on that particular use case. And definitely that's something that we have, you know, when we're doing our, the design of our software, we are targeting that kind of persona. So some of the key things there is you have the data at your fingertips and the tools should be able to do, should be able to help you and, you know, cleanse and shape that data without having to have those technical skills. Some of the other things that are key is what somebody was asking in the previous question, which was, you know, how to find the data. Typically the finding and the, you know, doing data discovery is something that requires some level of technical skills. So let's say that if I am, you know, browsing the schema within a database, if I am a, you know, business analyst that I just want to do reporting, I would likely have some trouble finding the data because, you know, schemas and tables and column names are something that are described in mostly in IT terms that a DVA would be able to understand. With the ability with some of the collaboration and data discovery tools would be to actually allow users to describe that in a way that is irrelevant to a business user so that they can find the data without having to know and understand those technical specifications of the, how the data is structured and how the data is formatted on the data sources. And is this the version that leverages Spark? So, yeah, so, you know, that is interesting. So, data works, although you actually don't see it anywhere, is backed by Spark. So the processing engine behind the data works, data preparation and utility is a Spark. So it's one of the first applications that we delivered and we are leveraging the power of Spark for doing the data preparation behind the covers. Sure. And the same question wants to know, can I load data from object storage to cloud sources or dump data from cloud sources to the object storage? Object storage is currently not supported in the current service, but will be in, again, we're talking about a few months time. So in the next few weeks, months, we should have that. Sounds good. And how is the source data stored initially? And then once the user selects the source, is it essentially copied? Okay, so that is an interesting thing. The different tools for data preparation behave differently. The approach that we've taken to data preparation is that the environment that you're actually doing, the shaping in is an environment where you're actually getting a sample of the data. You're working with a sample of the data. The data is, while you're doing the dynamic shaping, it is just done on Spark in memory. At the end of your interactive data shaping, what you will have is something that we call an activity. And an activity is basically something that gives you a repeatable workflow. That repeatable workflow is designed to actually read data from wherever you're reading it, wherever the sources of data that you're accessing to, and copy that over to the target. So it's basically a source to target data manipulation tool. And we are, until you actually run the activity, there is no data movement that occurs. And all the data, the user is specifying which is the source and which is the target using the tool. I love it. Well, so the next question is, is there a data catalog internal to the tool that keeps metadata? Is this exposed to end-user to use as a metadata catalog? So, right. So the first person, when I was talking, I think there was another question that was akin to that one. There is this notion of collaboration and social collaboration. And that is a feature that will be common if you actually, you know, if you actually look at the tool today, there is some coming-soon features. And this is one of the features, you know, features that will be released in the short future is the ability to have a data catalog. The main focus, because we're talking about line-of-business users, is a tool where users can collaborate. And when we're talking about metadata, it's really about describing that data. So as I find data in my data sources, I can publish that data to the catalog. I can publish metadata about that data in the catalog, give it business description, tag that data, provide, you know, ratings, add comments. And essentially, it's a tool for users to collaborate. And what that will give the users is when we're talking about data discovery is a better way to search for data. So that once you have the data in that catalog, you can look for data without knowing exactly how it is stored or where it is stored. You just can go directly to the catalog and find that data. All right. And is there an interface to visualize query the lineage of data from source to target so that we can see where a piece of data is coming from? Not at the moment, but definitely that's one feature that we're considering adding in the future. How will the tool be enhanced to support integration activities like invoking a stored procedure on my cloud data source? Is this in the roadmap? Yes, it's something that will probably come in the future. I love it. Well, Hernando, thank you so much for the presentation today. And that's all the questions we have. Just a reminder, again, that I'll be sending out a follow-up email within two business days with links to the slides, links to the recording of this session, and anything else requested throughout the webinar. And thank you to all of our attendees who engage in everything that we do and have always asked so many great questions. We just really love it and appreciate it. I hope everyone has a great day. All right. Thanks, Shannon, and thanks everyone for attending this session today. Thank you.