 Ladies and gentlemen, please welcome co-founder of Databricks, Andy Konwinski. Thank you. Welcome back to the summit. This morning we're going to jump right in. Our first talk is going to be by Reynolds Shim, who is the chief architect of Spark at Databricks. Reynolds will be talking about real time and Spark. Let's welcome Reynolds. So yesterday, all of you have heard Matisse Kino, and one thing he said is we're going to double down on streaming this year. And this morning I'm going to be sharing with you some of our plans for the year for real time and streaming. So why real time? I think this question is pretty obvious in general that for some of the audience here, it's typically useful to actually make decisions faster and sometimes it actually leads directly to bottom line. For example, credit card fraud detection. You probably want to actually flag the fraudulent credit card swipe as soon as it happens rather than 10 days later where you don't even know where the guy that's actually making the fraudulent transactions are anymore. There's many other use cases. So in response to this, the industry kind of started this new trend or new generation of systems called streaming engines. And streaming engines typically just takes an input stream, sometimes multiple input streams, and produces an output stream or maybe multiple output streams. And what is the stream? Stream is really just an infinite list of data that grows over time. About three years ago, we kind of straw this trend in Spark as a result. Actually, to try to design Spark from ground up to actually respond to this real-time demand. One example of it is Spark SQL. Spark SQL can actually answer queries interactively very quickly. But the other more important component in Spark streaming, which was actually introduced three years ago in Spark 0.7, this is probably way before most of you have heard of Spark. And according to a survey we did last summer, actually half of the Spark users thinks Spark streaming is actually the most important component of Spark. So the popularity is certainly on the rise. Spark streaming was our first attempt at unifying what we call batch computation, basic computational static data and streaming computation. So this is a real-time aspect limit. It has many nice features that were first of its kind in streaming engines, including like building stage management, exactly one semantics, which I'll explain in a minute. And there's also a lot of features that are very nice for scaling out the computation to very large workloads, such as regular mitigation, low balancing, and fast forward recovery. Now, just to touch upon exactly one semantics, what do I mean by that? For some of you, they are less familiar with streaming computation. Typically, given some operation you want to do in streaming, there's three different ways you can do it. One is what we call at least one semantics. So you actually apply that computation at least once for a given event. The second one is at most one semantics. This means you apply the operation at most once. So imagine credit card fraud. Actually, just imagine credit card swipe. Whenever a swipe happens, in the case of at most once semantics, you basically charge the user at most once. So sometimes you might not charge the user, and sometimes you might charge the user. It doesn't sound very great. What about at least once? It means you charge that user at least once, but you might double charge, you might triple charge. It doesn't sound very nice either. You could actually get an angry email from the users. The first case, you would lose money. The second case, you might get angry emails. Now, both of these cases are actually really easy to do. I can give you a solution that requires no work. For example, for the case of at most once, just never charge your user. Just return immediately in that computation. Never run anything. In that case, you have at most once. Even though technically, you're always returning zero, but it's doing at most once. So now, typically what you really want is what we call exactly once, which means if the user swipes a credit card, you charge the user exactly once. So since Spark Streaming was introduced about three years ago, we worked with hundreds of streaming deployments with a lot of Spark users and customers. And we've learned a lot in this journey. And one thing we learned is streaming computation typically don't exist in isolation. It's actually typically combined with a lot of other kind of computations. So let's take a look into one use case again. Just credit card fraud detection. In the case of credit card fraud detection, typically you have a stream of data that's just credit card swipes. And then you want to determine for each of the swipe whether you actually have a fraud or not. So this is our anomaly that happened. Now just imagine I'm a credit card company and I've decided I'm going to implement this whole system in streaming. And I'll be using a very simple algorithm. And the algorithm is if the credit card transaction happens, outside of the 15 miles radius of my user's residence, I'll flag it as fraud. Now, can I get a show of hands if any of you have gotten a credit card transaction denied because you might be traveling somewhere? Everybody. See? Big data is still not really solved. Now, this model is obviously not very good, but sometimes you would be surprised how far you can actually use this model to get going. But what really happens is as soon as you get more user complaints, you probably want to look into how you can actually improve the model. Now, how can you improve the model? Actually, you typically involve some data scientists or analysts or modeling team that's actually go and look at historic data to actually understand the trends and spending patterns better so you can build better models. And at the same time, you might actually have a different stream that's actually updating some machine learning model that's actually in real-time to actually improve the model you're already using in production. And typically, this streaming pipeline might be actually different from the streaming pipeline that's actually happening with the credit card fraud because in the case of fraud detection, you want to actually flag the fraud as soon as it happens, maybe in a range of milliseconds or hundreds of milliseconds. Whereas, in the case of training a new model, you probably don't want to be actually deploying your model in a hundred millisecond interval because at some point, you might want humans to actually step in and take a look to see whether the model actually makes sense on that. So this use case itself involves multiple streams, involves actually looking at data, historic data in an interactive fashion and also involves running a lot of batch computation. You're looking at a lot of different use cases. This is not unique. As a matter of fact, a lot of the streaming use cases actually involve many, many non-streaming components. As a result, we actually decided to actually call this class of applications continuous application to give it a new name. So what is a continuous application? Continuous application is just an end-to-end application, the axon real-time data. It's a very simple definition. With this definition, you could almost call any application a continuous application. So since we studied this pattern, wanted to look at how Spark can actually simplify the building of continuous applications. Now, what makes building continuous applications difficult? And I think there are two reasons they are difficult to build. The first is it's very difficult to actually in general integrate streaming systems with non-streaming systems, which is actually typically necessary in continuous applications. It's difficult to actually have interactive analysis on live streaming data. It's very hard to run bash computation on historic streaming data with a consistent view of the data. It is very hard to actually output streaming data onto some relational database in a consistent fashion with exactly one semantics. It's also very hard to do machine learning on streaming data. And the second one is streaming computation model is extremely difficult. It's difficult to understand, it's difficult to learn, it's difficult to reason about. Let me give you one simple case. You can't really get simpler. We have a stream with some two traffic logs. We pipe this stream into a streaming engine, and all we want to do is aggregate some count. So I want to see how often each page gets visited at any given moment. And maybe I'm doing just some simple analysis after this. Maybe I'm actually charging my clients based on how much traffic I have generated for the website. Now, this seems really simple in this case. What can go wrong? It turns out there's a lot of things that can actually go wrong. For example, one of the processes that's actually collecting the logs might be slow and send you the records or the logs that are actually delayed. Some logs might actually literally show up two days later. In the distributed system, you might actually have some nodes that are outputting your data and the other nodes are actually not functioning and not outputting, so you're actually getting partial data. The whole pipeline might fail and when it recovers, it might actually give you inconsistent results. There's a lot of other problems that could happen. Which actually leads to streaming computation model being extremely difficult in three different dimensions. The first dimension is data. It's very hard to reason about data in streaming computation because data might arrive late and have varying distributions over time. Which means one algorithm that might work in a particular given instance of time might not work in the future. Processing is fairly difficult because over time your logic might actually change. You want to apply new machine learning algorithm, you might want to actually change your whole algorithm completely. There's also new kinds of operations such as windowing and sessions. And last but not least, the output is actually difficult to define and now you have some sort of time spending to define what output actually means over time and how do you define correctness. So for the past few years, we've been thinking about how we can actually simplify streaming computation because we don't want only the PhDs in computer science that can actually understand this and build robot streaming applications. We want to really take it to a scale to the masses. And I think finally, we have a satisfying answer and what we're calling the structure streaming. And it comes with one simple realization which is the simplest way to perform streaming computation is to not having to reason about streaming. And you heard me right. I did drink my coffee this morning. I'm not making this up. The simplest way to perform streaming analytics is to not having to reason about streaming. What do I mean by that? So in Spark 1.3 which was early about a year ago, we introduced data frames. And what data frames exposes is a very simple API that's similar to the tools that people have been using on their single node tools. You can use data frames so interactive and in a bash mode deal with historic data. So we thought, what if in Spark 2.0 we just expand data frames and generalize it to deal with infinite amount of data and introduce a new concept or new API called streaming data frames. So you can run the same data frame operations just on a stream. So a new API in Spark 2.0 with streaming data frames. But then we thought nobody really wants that. It would be so much easier if you have a single API with just existing data frames. So that's what we actually did, structure streaming. It's essentially extending the pre-existing data frame API that builds on the Spark SQL engine. And what it does, you can actually run the same data frame operations over a stream with new operations such as event time support, windowing and stashing, and streaming specific sources and things. And this is our second attempt and I think maybe the best so far to actually unify streaming interactive and bash queries so we can substantially simplify continuous applications. Examples such as aggregating data in a stream and actually output it into some MySQL database should just work out of the box. Or even better, maybe just within Spark you could actually create directly the state that's being accumulated or aggregated. And you should be able to pipe it directly into the machine learning library in Spark and use that on a live stream. So for the next five minutes keep warning. I'll dive a little bit more into the details which is probably atypical for Kino. For those who are with me, for those that are slightly higher level and I think for those developers you'll probably actually like this. So let me explain the model. First we have time, this is just your normal time there's nothing special here and we have a concept for a trigger and what the trigger is it just defines how often you should run some data frame operation. In this case I have a trigger of every second so every second something supposed to happen. Now from the beginning of time to the first second my stream has accumulated some data. In this case I've defined a data frame query let's say it's just some simple aggregation so the simplest one just count. I want to count how many records I have. Now this will output basically one record when I run this query on the data that's accumulated from time zero to time one and I just want to output all of this which is a single number out to some output. Now in time two I've accumulated more data so at time two I have all the data from time one plus the data from time one to time two and have my same query again just count. Now run this count on all the data and then just get me another number and output this number again and same thing with time three except this time I'm running all the data from time zero to time three. Sometimes in this case we actually output all the data all the time for my computation. Sometimes you might actually want to output delta so this is for example if I only run if I'm basically just doing some ETL and have new data arrive maybe I only want to output only the data not all of the data. So just to recap this very simple model there's four concepts there's input streams or input sources it's really just a table that's a pen only so it keeps growing over time it becomes infinite and then there's queries which are just typical normal data frame or SQL queries with new operators for window and sessioning and there are triggers which indicates how often this query gets run and there's output modes and I think we're going to support three different output modes includes complete which is always output all the data and then delta which is output obviously only the differences between the different triggers and then update in place and then make sure that you have a directional system in place directly. Now let me give you a couple more examples to slightly more concrete to help you understand this. Let's say in the case of if you're building a streaming ETL which is actually turns out one of the most frequent use cases for doing swimming is to ETL something continuously. So I have data that's in F3 on some bucket and then I have some data say so the query here is very simple just some simple math that turns JSON into parquet and for the purpose of this application I just want to run every five seconds and output modes I want to own basic deltas only for new records and I'll put it directly back into some F3 buckets so this is how I'll define this job. Now the second one which is similar to an early example is to run page view count and then I'll run it three times for each minute each of the page on my website has been visited. So the input might be some locks that's generated by my HTTP server in Kafka it's coming from Kafka and then the query is very simple I'm using sort of a SQL query to actually explain it it's basically just counting grouping by the page in a minute and trigger is again my SQL database I have so I don't need to query in Spark or anything I can just query my SQL database now the nice part about actually this one is sometimes data might arrive late which I was talking about in my early example and conceptually this should actually automatically correct my output if I update in place when the data is arriving late so if some data arrives like two days later my SQL database that contains actually all the accounts you get automatically updated so all you are suspicious like I see from your face first I walk up stage and told you the best way, the easiest way to do streaming is to not think about streaming and now I'm telling you we should just run all the data all the time if I want something to happen at time three I run this query on all the data from time zero to time three because this guy is smoking so what I've just explained to you is actually the logically how you should reason about this new model structure streaming essentially you're just applying data frame operations on static data and think about data from zero to some time when the queries run but when we actually execute it this is actually very similar to how data frames always been executed what this logical programming model is giving you is making it as simple to specify streaming queries as just static data and then what it really generates is for all the operations you're doing we generate a logical plan and spark actually through the catalyst optimizer it's going to turn this plan into a continuous and incremental query that will actually run in an efficient mode rather than running through all the data all the time let's go back to some aggregation example let me show you some code so this is sure how you run some aggregation in data frames looks fairly for those of you that are familiar with data frames it's probably self-evident what's actually going on we're just reading some JSON data from S3 logs and then I'm grouping by a few user IDM just aggregating which count for example or sum up the total amount of time I usually have spent on each of the page or each of the user and I want to write it out to JDBC database this is how we express this against static data so under this new model the way you actually express this for streaming continuous aggregation is to just replace those two calls with .stream and now you can actually drop new files new logs, their ship from maybe some other system into S3 and every time a new file appears we'll actually pick it up and run through this pipeline and they actually output it in place in this case you'll be bad if you run this query a year from now on all the data from now to the year later so the way it actually works is the first time Spark SQL sits actually a JSON file in the S3 bucket it's going to do some aggregate the second time a new block of data arrives Spark SQL is going to do the aggregate except in incremental fashion so we'll only be aggregating on the latest data and we'll actually combine automatically and we'll be able to run on historic aggregates without actually users having to worry about this at all and when new one arrives again the same thing will happen over and over and over again so this is actually not something we're only doing about streaming as I said continuous applications really needs more than just streaming and as a result rest of Spark will actually follow and be revamped to support this so for example we're using the same data frame abstraction there's only a single API and the second is Spark we'll introduce a new Spark data source API that can actually support exactly one semantics end to end and the reason this is worth highlighting is actually a lot of streaming system I think Spark streaming might have been one of the first streaming systems that support exactly one semantics so you don't want to charge a user either less or more times it turns out a lot of the streaming engines are work with exactly one semantics within the engine itself but as soon as you actually want to involve some output operations it's actually very difficult to guarantee you have exactly one semantics so within the engine where I'm doing all my aggregates and all of that I can guarantee records get processed once but as soon as I want to actually write it out to say my SQL or Oracle any other things exactly one semantics breaks here we actually want to have one semantics end to end which means from beginning to end for all of this and we have different output mode supports that supports actually writing out all the data just partial data in the delta fashion or updating data in place and machine learning algorithm will also be updated to actually support all of this so besides a very simple programming model that you only need to really think about batch data because you all have already know how to do it just by using the rest of Spark it's actually very difficult without the engines at the moment I think there's a few things one is worth highlighting it's actually very difficult to interact with queries directly on the stream in a consistent snapshot fashion and the second is dynamic changing queries I touched upon this earlier when I was talking about credit card fraud detections maybe sometimes you want to change your algorithm completely and you don't want to bring down your entire streaming applications or continuous applications and you don't want to bring down queries automatically, dynamically and of course you also get the benefit of Spark that you can have elastic scaling can have fault tolerance and all of that in this new engine so we keep all the nice perks Spark always had we're adding many new features with this much simpler programming model and really what we want is if you need to build ever something like this or even more complicated it should be a lot simpler a lot of things should just work out of the box without you having to worry about all the many great details about streaming so I'm just going to end the talk with a little bit of timeline as Matee announced Spark 2.0 will probably be released at the end of April or in the beginning of May in Spark 2.0 we'll be having this API foundations built up and really it's not a lot of new things from the user's point of view because there's really just data frames there's no new thing to learn it's just a new data frame API and there will be in the first version CAFTA data sources with file system data sources file system data sinks and also databases because we thought a lot of you probably have some relational database running who want actually to integrate out of the box with that with basically the correct semantics so you don't have to worry about doing that and you'll have to also support for event time integrations and Spark 2.0.1 and beyond it's a little bit hard to predict where exactly each version will have a lot of efforts into actually making this expressible in SQL so actually even some of your analyst friends can actually use this without using typing any code and then there's BI app integration that's built in plus other machining sources and things and streaming sources and things and machine learning alright so this is the end of my talk thank you Michael Ambras who will be giving a talk I believe it's one of the first technical session of the day where Spark 2.0 is going to connect with the rest of Spark structured I think the title of the talk is called structuring Spark from streaming all the way to data sets to data frames if you want to know more about details of this I encourage you to check out that talk thanks a lot for coming Spark 2.0 is very exciting our next speaker is from Capital One it's great to have folks from the enterprise come to talk about how Spark is bringing value to their use cases and the vision of the Summit yesterday and today our next speaker is the VP of technology for big data and credit cards at Capital One let's welcome to the stage to talk about their use case for Spark Chris Dagestino good morning thank you for the introduction I'm Chris Dagestino and as you mentioned I'm one of the engineers responsible for Capital One's big data implementation across credit card so today I want to talk to you a little bit about how we're using to better serve our customers, build better fraud defenses, and in particular what we're doing with Spark to help enable that. So let me give you some context about Capital One. So we've got a lot of customers across a broad range of financial products. Many of you are familiar with our credit card products, but we also have products in banking, mortgage, auto, and commercial. In total, we have over $200 billion in assets and deposits. And we've got 72 million customers that use our products each year. So this makes us one of the top 10 US banks. You can see a list of some of the competitors that we've got. We were founded in 1988. We went public in 1994. And in fact, we're the only bank founded in the last 100 years. So let's talk a little bit about what we do from a digital standpoint. We have inherently digital products. So from the transactions that are done online and with our mobile products, as well as the card present and EMV card swipes, we've got a ton of data that's coming into the company. And we have a range of both mobile and native mobile and responsive web applications to support those. You can see here that we've got customers that predominantly contact us through digital channels. In fact, 75% of our customers interface with us through these. But the other channels are important from a fraud defense standpoint, because it's these applications and these systems that traditionally are siloed and are points of attack for fraud rings. So if you take our customer transaction data, you take our servicing touch points, the instrumentation we have in our applications for clickstream data, if you add in our new credit card applications, which totals in the thousands per day. And if you take the third party data sources and feeds that we use to process customer information and build our models against, we quickly scale up to petabytes of data. So for us, it is all about the data. We're a data driven company, our strategies are based on having hired some of the best statisticians and data scientists in the industry. So everything from real time credit card authorizations with advanced fraud detection, our customer servicing touch points and our marketing campaigns to drive new customers to our products. All of this leverages massive amounts of data, massive amounts of customer history, and it's our job inside of Capital One to build systems that leverage it and produce fantastic results for our customers. This is a chart that we use a lot when we evaluate the development of a new system and we try to take into account the importance of the data that's coming in, both streaming data, landing into the system in real time or near real time. Combine that with the data that we've got in aggregate and understand that intersection. And so this is where the Spark platform has been really valuable for us, where we've been able to combine large data sets for historical information in both SQL and graph format and be able to query those systems, apply them to the models that we build and execute those models to make scoring decisions. So we operate in a really sophisticated environment with fraud rings that know a lot of different attack vectors. I won't get into all the details about what we see internally at Capital One, but we do see a lot of activity around synthetic IDs, hijacked accounts, stolen IDs, things like that where fraud rings will attack, come in through one of those different channels that I mentioned earlier, try and get some action to take place and then exploit that through another channel. So the systems that we're building are trying to combine defenses around data sets that combine from those different systems. And one of the things, we're here in New York City, we've got a lot of our competitors probably in the room, certainly nearby in the city. And so as a publicly traded company, we were very sensitive about what we could talk about. And so I wanted to give you just a quick flavor of how we build our fraud scoring engine, and then I want to move on quickly to some prototypes. So here's some background. This was what they were telling me, I was allowed to say publicly of how we compute things. So if you work for any of our competitors, I need you to leave now, and we'll talk about some of the prototypes we're building. All right, so in truth, we use a lot of different technologies in our big data ecosystem. We're a very innovative company. We would like to experiment with different approaches. And so you'll see a full range of things here. In addition to that, we toy with all the different public cloud providers. So let me talk a little bit about one prototype we've built in Amazon specifically, but we've actually done this in other cloud providers as well. But we leveraged some of the work that Databricks has done. They've written a great blog. You see the URL posted down there. We've taken data from our data centers. We've piped it into Amazon, into Redshift specifically using the parallel copy capabilities that you can run through S3 and into Redshift. So you see that there. We then take that data and using the Spark Redshift package from Databricks, we actually parallel copy the data back out, run it through a Spark cluster. You see the transformations that are taking place to produce the data frames. We then take that data set and we create both tabular and graph views of that data. We then take those views of the data and apply the models and the machine learning algorithms and run it against it for fraud detection. So what we're looking for here is those hijacked accounts, synthetic IDs. We're looking for nearest neighbors for people who are applying for credit cards. Have they applied before? Are they actually existing customers looking for a new credit card or an upgraded credit card? Are they people who have applied before and have been declined for one reason or another? Are they applying because they're part of a ring and there's some attributes about their application, whether it's click stream data or information that we collect through the application process that tips us off that they're somehow connected to someone else that might be fraudulent. So let me talk quickly about a few highlights for what we've seen with our Spark development journey. So we're prototyping new and sophisticated models. In fact, we're able to run models in parallel in the cluster and be able to score things and see different results from different scoring engines. The tabular and graph representations of data simultaneously is an important feature, and then we're doing things like auto deploying things out, which is pretty common using Chef into AWS. We're writing code internally that some of which will open source, but some of the things we're doing internally for auto scaling, our clusters are currently based on CPU utilization and we're trending more towards outstanding tasks and scaling up as the task load increases. And another thing that we've done to externalize the data pipeline workflow is we've developed our own mini language or DSL, if you will. And so I'll show you an example of it here. It's a JSON file that describes the workflow. So we externalize the steps that you need to follow. We can have data sets that are pulled in. And so in this example, it's just a sample workflow where we would pull data in from an S3 bucket, run a filter on it to filter out customers, say for a marketing campaign. To find those customers that have over $100,000 in annual income. And then we would write out the results for our marketing teams to use as part of their campaigns. So Capital One is quickly becoming a world-class software company. We are modeling ourselves after many of the great software companies that exist out there. And one of the things that we've adopted in the last few years is a full agile strategy for how we build software. So this is beyond just our software engineers. We are actually in partnership with our business. The business is going through and defining the priorities for the applications we build in an agile manner. We are getting our teams to be multi-skilled. So you see here we've got full stack development teams where for any project we will pull engineers from different tiers of the application stack and bring them in as part of an overall team. One of the things that we've got to worry ourselves over a lot, especially as we look at various public cloud providers, is the data encryption data security around PII and PCI data. While at the same time, encrypting that or tokenizing that information, trying to preserve the analytic value that's important for our analysts and obviously the fraud models that we run. So we've got world-class talent. We continue to hire. We hired thousands of engineers and data scientists this past year. We'll continue to do that. And this is kind of the prototypical model of what we're looking for. We're trying to take our existing statisticians who are very strong in SQL and have some general programming experience. Educate them on the Spark environment. Educate them in Python, Scala, Java. And try and get everybody in the company thinking about distributed computing. What it means to process both structured and unstructured data. Be able to be multi-lingual program and different programming languages. And then clearly understand the importance of algorithms and machine learning. So there's a short talk today but wanted to thank Databricks, especially we've been working with them and training and obviously some of the work that they're doing around the Spark community and then the blog post that they posted that we used to leverage our prototype. So that's it. Thank you very much. Thanks very much Chris. Our next speaker is from, our next speaker is from what was recently Razor site, but is now Synchronous. And he is the VP, the Senior Director of Big Data Platforms. Excuse me. Please welcome with me to the stage, Seren Nathan. Good morning everyone. It's great to be here at the Spark Summit and I'm excited to have this opportunity to talk to you today. Before I start, I just want to do a quick audience poll. How many of you are currently in or have been in the trenches doing data munging, wrangling, cleansing, and all of those good stuff? Okay, so hopefully this talk is for you. I can assure you when this journey began, I had a full set of hair. But then again, correlation is not causation, so. All right, who am I? I'm the Senior Director of Big Data Platforms and framework at Synchronous. We were recently acquired, I mean, I was with Razor site recently acquired by Synchronous late last year, been in this space for a long time. And then my goal is to solve real business problems with the latest technology, that's what everybody wants to do, but let me say that anyway. So how many of you have heard of Synchronous? Okay, Synchronous is a publicly traded company. We've had quoted right 45 minutes away in Jersey, and we offer personal cloud and activation platforms for large enterprises and communications providers around the globe. So what does that mean? So if you use a mobile device, any mobile device, chances are Synchronous is actually working behind the scenes for you. So whether it's activating your device on the network, whether it is migrating content, synchronizing contacts, whether it is setting up a personal cloud environment for you to move content back and forth, your pictures, your videos. Synchronous is platform and software is enabling all of that. So our solutions help operators connect to their customers. So whether you're onboarding a customer in the form of a device, whether you're allowing them to synchronize data back and forth from the device to the cloud to other devices, or if you're driving a connected car. The latest cars have 4Gs in them. Synchronous activates the connected cars. If you have a connected home, Synchronous probably activates a connected home. So that is the ecosystem we are in. And Razor site used to offer predictive analytic solutions to the communications vertical. So the marriage is all about applying our platform and products and models to the solutions that Synchronous offers. So we are part of the Synchronous analytics group. Just to give a sample of what is big data at Synchronous, what do we do? This is just a sample for one operator, large operators deployed the personal cloud solution. We're talking about 30 million active subscribers on the app, about 8 million daily active subscribers or users. They are uploading anywhere from tens to hundreds of millions of pictures every day. So the data size is staggering. We have deployed the solution across five data centers, running on multiple clusters and all the good stuff. So this is truly big data, all those events coming from those devices that can be used to improve the customer experience. Whether it is looking at better application functionality, whether it's looking at crash analysis, predicting failures, rolling out of applications and release versions. All of those good stuff can, the data can be used for. So what does my team do? We are responsible for the big data platforms and frameworks that is used to generate those consistent analytics. The platform is deployed both on a private cloud and public cloud AWS infrastructure. And when we talk about analytics here, we talk about the full range. So it starts with the traditional descriptive analytics or BI world and the advanced predictive analytics. So both end of the spectrum are there that we have internal users, we have customer users who consume this insights generated from the data. You should be very familiar with this in order to make any meaningful use of the data it has to be processed. So right from ingestion all the way to profiling, parsing, transforming, enriching, aggregation, et cetera, down to the downstream processes which may visualize it, which may apply models on top of it. This is what we're talking about data pipeline process. I'm going to walk you through what we've gone through and what does this mean. It's not simple, people waffle over it. This is where we spend most of our lives. Data is not necessarily clean. Data is not necessarily structured, semi-structured. The definitions are missing, legacy systems, there's all sorts of things happening in here. So our journey started with version one back in the day and you folks should be very familiar with this. This is the day of the multiple ETL jobs running outside the context of data. Storage and processing were separated. Things were running in long running batches. And now we encounter any large volumes of data set, the latency increased. There's no support for unstructured data. Historically speaking, these sort of solutions took a year to put in place. And it's pretty expensive, inflexible, large teams working across. We could not store large amounts of data online because of obvious restrictions, so this was the life back then. Then we entered into an appliance world where, okay, we'll put the storage and processing together in one verticalized appliance. It was great, performance improved, latency reduced, but cost increased. Still we have to do it in batches, although low latency batches. Still we couldn't support unstructured data. Still the costs were so prohibitive that we couldn't store the data there. We had to do it just in time processing. Maybe store a limited amount of data and move it out somewhere else. Didn't work out. So move on to the next version. Then the whole Hadoop thing came about and said, let's go there. We looked at that, and we said that there's a big skills gap here. To have a bunch of people who are familiar with certain technologies that migrate into that. We said, let's take a pause and see where this is headed before jumping in. Then we saw MapReduce, a big hive, and a whole bunch of other acronyms. So we said, let's take a pause on that. We didn't want to do a technology migration for the heck of it. What we realized was the benefits would not be there immediately. So a couple of years ago, mercifully, out came Spark. So Spark provided a promise. It had everything required for pipelining, it had streaming, it had batch, it had SQL access, it had rich features in memory storage. So performance was better. So we said, let's take a look at that and let's centralize our pipeline process on this platform. So this is what we call our V4 data pipeline. ETLs are closer to data. We can process stream or batches, superior performance compared to MapReduce and other options. So what we did this time is we didn't want to open it up for every single developer or app developer out there. We abstracted it and built a framework. So we said, all the components needed for data pipeline processing, let's build components and then expose those components to those app developers for them to hook it up in a data pipeline process. So it simplified the design. It significantly reduced the time for us to roll out a solution. And it was highly flexible for us too. So with that, I'm going to go into data profiling. This is interesting. Good old days, we had volumes of data and when you want to do profiling, especially for the modelers, they would say that take a sample and then we'll profile the data in a sample and then utilize it to build the model. Then in the big data world came all the conventional wisdom is, no, use the full population. Don't use a sample. Run your model, train your model on the full population, okay? Then there are others that said, put everything in the lake and then somehow everything will work out. How does that work? We still have to go through what the data is to still need to understand the construct. So why do we still need data profiling? We need to understand what is in those data sets. We need to understand metrics. We need to understand the risks associated with creating rules. I mean, when you want to create an analytic data set, oftentimes you have to stitch the data to create that analytic record, which then could be used by the modelers. So when you stitch it, how do you generate the right rules? How do you make sure the quality of the data is good? Can we identify the metadata from the data set so that we can create those configurations that can be used in the pipeline process instead of manually hooking it together? How do we understand the challenges of data and inconsistency ahead of time? So anything you find later in a cycle is always more expensive and tough to fix. Also, another category of solutions came where you want to do ad hoc search, ad hoc full-text search. So in order to do those, you need to tag the data. How can you do tagging of data, categorize the data without profiling? So profiling became a key aspect of that as well. So when we looked at this challenge, and this is where most of the time was spent. If you broke down the project lifecycle, munging, wrangling, whatever term you want to use, that's where most of the time was spent. And if you really broke it down, there was a lot of touch points. So from the time the ingestion location data was moved to some other location, and there were many policies, security concerns. So moving data here and there was not possible. And as a result of all of that, there was just the interest level in any project is how soon it can be delivered. It takes months, multiple months, and the opportunities typically lost. So we wanted to address this particular challenge. So the typical scenario, I'm not saying everybody has this, but I have seen this many a time. We have analysts, business analysts, want to get a bunch of data into an Excel or somewhere else and look at it. Big data, you cannot do that. Okay, so we'll put that in a database and run our profiler, but then you cannot put it in a database unless you know what the data is, what the schema is, what the structure is. We get data from customers and they say this is what it is, but it is nowhere close to what it is. So that went cycle time there to figure out what exactly did you send us that but then you couldn't move data back and forth. That's the fundamental problem. You could not move it from a data lake location into a database or back to some other store. So all of those dependencies were causing a huge headache. So what did we need? We need speed. We need agility, we need automation. How do we automate this thing? How do we put the power back with the business analyst or data analyst? So we set out with this is the minimum data profiler requirements. We said all data is going to reside in the data lake. So you should be able to profile the data in the data lake. You should be able to review and validate the data. You should be able to review the statistics of the data. You should be able to use that same results to create metadata to run your data pipeline processing. You should also potentially be able to create downstream schemas. So if you're going to load this data into an index or into a downstream database, you should be able to create the schemas automatically. This were the goals that we set out to achieve. So Spark came to the rescue. We have large data sets, multiple data objects. We can move that into an RDD, split it up by field and run all sorts of metrics. We can use built-in transformations by Spark. It was very nice. Performance was great. So how does this work? So the simple flow is we have a very usable web application. The user does this points to a data lake location. And say, pick up a set of files based on masks, the full set of data objects, and then launch the Spark application, which then runs in the background profiles the entire data set and publishes the result to a repository which is viewable in the web application. Pretty simple. What it generates is a bunch of univariate statistics, whether you're a numerical field or a non-numeric field. There's a whole bunch of things that is needed by the data scientists to say how many nulls, can we create imputation rules out there? What's the health of those various attributes or the histogram? Ritosis, mean, media, and all those good stuff comes out, which can be used. This can be for an individual data set. It can be for any data in the data lake. It could be merged data, stitched data, enriched data. At the end of the day, these things are important before you can start the modeling process. So this is a sample screenshot. What it looks like is a simple Angular web app. You can go in there and pull up a result of a particular data set profile, give you a color-coded health of a data field, green, orange, red, however you want to set the threshold. I present all those statistics about the data field, whether it's amount of null, histograms, your box charts, things like that. So in language that is very usable for the data analyst or business analyst or data scientist. It also generates, as I mentioned, a full-fledged JSON metadata. So when you process the profiler, the profiler looks at all the fields. It does not only, it understands, it generates the data type. It generates the content statistics. It also generates the JSON metadata, which then is used by the data pipeline workflow. So if you want to operate on that data set to transform it and enrich it, you could use this metadata to drive that. It also generates schemas, downstream DDLs automatically from the profile output. So user doesn't have to go in and create all of that. Some data sets are very large and there are 40, 50 of them. You can see how much time can be saved by just profiling it, creating the DDL. The advantages are you can, all the source data already is in the data lake. It's been dumped into the data lake per the data location. All the profiling can be done in the data lake. There's no need to move the data back and forth. You can profile the entire data set, you don't have to work with the sample. You can integrate the results into a metadata configuration or a downstream DDL. All of this saves tremendous amount of time. It might sound trivial, but for those of us who go through this for a living, it's a lot of time. So objective is to send cleaner data down to the modelers because at the end of the day, if you want to generate roots, if you want to generate enrichment, the data pipeline process can be built accurately to cater to the needs of the data scientists downstream. So we've seen significant improvement as a result of this sort of approach. What used to take weeks, sometimes days, now it's cut down to hours. The overall data pipeline process has been reduced 80%, I would say. Which is why we say from the time we receive the data, we can put out full fledged metrics in the form of, let's say dashboards and descriptive insights. And under a month, we have identified data quality issues sometimes that trip us up ahead of time, empower the business analytics as well. Anyway, so I want to quickly go through this. This is just one component of our pipeline process. Profiling is the first part, but when we built the stack from the ground up with Spark, we said, you know what, we need a multi-layer architecture. Each layer logically performing a particular function, right from ingestion to data storage to data processing to modeling to integration to consumption. It's a pretty layered infrastructure there. And this is the architecture we have in place today. We just talked about the data management layer. So the framework components, they are all Spark components, at least in the profiling, parsing, transformation, integration layer. Each component has a set of functions. These components can be hooked up in a simple Uzi workflow, completely configurable through metadata. So the building blocks are available for the app developers. They don't have to sit in and write all those transformations. In fact, in the profiling and parsing, we have our own scripting engine integrated into that. So it's very easy to transform data, right? Some cleansing rules, lookups, substitutions, imputations, all of that are very easy to do with this sort of framework approach. If you look at the architecture, we have the data lake. We have the orchestration layer, which is Uzi. And then we have all the green boxes there are components in the pipeline, whether it's a SQL engine or a data prep engine or a database loader using scoop or a partitioner. The whole thing is built in in a component fashion. And then if you look at the stack itself, we have a certain software in there. We've used Elasticsearch for index storage for quick retrieval ad hoc analysis. We have the data lake, just a mapR distribution. We use Spark, extensively in the data processing arena. We have our own AngularJS visualization layer. What's next? We expand it. We continue to expand our component set and move more into the value aspect of it, bivariate analysis, multicollinearity, all of those things that typically is done on the data. We want to componentize that and string it together in the data pipeline as after the univariate. So the variable creation of the analytic set creation. I'm going to zip it through here. So the lessons learned here is let the business drive technology adoption. There are a lot of hidden costs. Plan incremental updates, deliver something to the business periodically. Simplify the whole thing. Framework-based development is very, very helpful to speed up delivery and also reduce our overall cost. I mean, at the end of the day, what our customers need is what is on the right. I know when a customer calls into a contact center, they want to know what is the lifetime value, what is the churned risk, what is the profitability. That's the kind of information they want. All the stuff on the left, that's the big data stuff, right? So we are all about delivering what's on the right to the customer to use the data insights to better their business. With that, I will end my talk. Thanks for the opportunity.