 Thank you all for being here tonight. I'm gonna keep it fairly brief. My voice is not not doing so well today So I don't I don't want to wear it out entirely I also figure you know you probably heard most of what I have to say anyway already I don't know that I have that much fresh for you But maybe maybe it's it's fun to fun to see the old guy talk still So you know we're in the middle or I actually argue still the the early days of a pretty phenomenal transition On all kinds of levels You know that the the the technology platform That we're now deploying and it's starting to become pretty standard across a lot of industries is Create constructed totally differently From from what we were doing You know ten years ago And it's it's good to understand The differences and how this this this new world functions In order to operate in it, which I assume You're all doing So probably you're already aware of of how this works But I thought it'd be worth going through some of some of the underpinnings You know ten years ago Mostly folks were using Just relational databases for enterprise data really wasn't a lot a lot of alternative to that They were running on very expensive hardware exotic hardware that was Only used to run the sort of enterprise database software And they were paying a lot for the software. It was all proprietary software With with largely proprietary APIs, you know sequel is a standard but not not implemented necessarily uniformly across vendors and All that's changed. I Think the most the most fundamental change Is is in this this creation of the software? so I first learned about this in About about the power of open source as an alternate way of creating software Really in in 2000 I'd experimented a little in open source before that But in 2000 I Had written something called Lucene a text search library And put that up and it you know probably wasn't the best Test text search technology in the world. It certainly wasn't the most complete But it was open source And it seemed to start to gather momentum and slowly over time. It's become Probably the most successful tech tech search software predominantly Because not not because not for technical reasons, but rather be for social reasons Because of this open source community that's grown around it of people contributing and people Evaluating it at no risk Adopting it at no cost in many cases And even when they are paying a vendor, they're paying a vendor for services rendered. They're not really Paying a vendor just just for for another copy of something Which the vendor is not putting any more work into So it's it's it really highlighted to me How this different way of delivering and producing software? Could really make software succeed that it was an accelerant to software so a few years later I Was you know working on trying to build a web search engine and We were trying to scale it to multiple machines Doing it at low cost on commodity hardware And Google publishes papers. You all know the story And we started to re-implementing Google's ideas This a notion of a distributed file system that they had GFS and and a mapper-duce engine across it on on top and We called it hadoop after my son's stuffed toy that he named And I had this the idea there really Was was not that original it was you know combining these these two concepts one that Google had developed and I Was in a position to realize The value of that Because I was trying to solve a very similar problem of building a large distributed system And and struggling with it without a general purpose library to help me trying to reinvent these this reliability for each each thing you do And at the same time I was aware of the power of this of open source to strengthen technology to spread it and So it seemed obvious to me that these ideas of Google should be Developed as as open source. It wasn't it wasn't any I don't think brilliant Inspiration it was it was a one-plus-one sort of a thing If if any of you would have done the same had you been in in my shoes at that point What's happened? Though since has really been phenomenal I do have big shoes, it's true You'd all fit in my shoes one at a time probably but not at the same time So You know what's happened since though? I think is has we've seen something develop even I think stronger built built around open source Another phenomenon, which is this ecosystem. You know, so, you know, we had this one project with multiple components initially just HGFS and map reduce yarn got added a few years later And it was its you know its own thing But people didn't use it on its own much They tended to add on some other libraries on top, which were created as separate projects like pig and hive from from Facebook and Overtime more and more of these projects got added adding more and more functionality each of them independently governed Till we're to the point. We are today where we've got you know, 30 or more projects providing a really wide range of functionality all Again independently governed and it's it this this system of growth of technology without any central control Through these different Communities, I think is really the the lasting legacy Any part can be Replaced over time and we're beginning to see that with Hadoop. You know map reduce is Over time gradually being replaced by Spark to a large degree Most of the things you could do in map reduce you can do better in spark It's a it's a it's a next-generation solution And it can Gradually replace it. There's other storage systems besides hgfs which may someday replace hgfs hgfs Seems to be going pretty pretty well so far, so I'm not gonna gonna predict its death anytime soon But overall we're seeing this this evolution. We've got different institutions That are that are providing these new systems, which are sort of the mutations in the evolutionary model And some of them Succeed wildly like spark like Kafka and others and Become new standards new new common features of the ecosystem that that everyone builds on And I think this is the the world we're we're now living in And it's exciting time. It's a little frightening because Change is difficult I think in previous generations people could Learn a set of technologies and use them through their career That's no longer the case Where we're all needing to learn new technologies Every every few years That's that's a cost. That's a price the benefit is These technologies are not just created at a whim Generally, maybe they are and initially created that way, but they're not adopted. They don't succeed Whimsically they succeed Excuse me Because they're providing real value because they're doing something that you couldn't do before or you couldn't do very well before or very easily before And so we're we're being provided with more and more powerful tools and able to do more and more things and Solving more and more data, which you know takes us back to the fundamental trend that's that's driving all this the reason why We needed this open source is is the proliferation of data throughout industries, you know Mark Andreessen's famous line about Softwarding the world. I think it's really about data That every industry is adopting technology You know we have technology throughout our our lives now and our on our wrists in our pockets in our cars You name it. It's it's got a processor today, and if it doesn't it's likely to have one soon These are all generating data and this data can help us understand How our world is operating how our businesses are operating if we choose to? examine it and So the more we have technology to help us understand it the the more we can improve our businesses our lives Optimize things up become more productive more efficient. It's you know generally a Very positive force That this this this use of data that we see In so many industries that I had no expectation. I would ever touch in in my career I I thought it you know Working in on search engines and data technologies I'd be constrained up to you know the web was it was it was a pretty huge success in my mind of Technology touching a lot of the world But now it's so much wider than that You know we got you know you probably work for a lot of these industries not just you know banks and insurance companies And you know healthcare tractors airplanes cars You name it entertainment every industry is Generating data and benefiting from the successful analysis of it There are risks for for sure Data can be abused We we as an industry need to spend more time thinking not just About what's possible But what's good what's healthy? For a society what's ethical? And and that needs to be a core component of every decision And we're getting better about that you're hearing More about that about a part of every data science course should be Ethics, I hope I hope how many people here have have studied data ethics in there along with their data science ouch Well now you have a new new project to work on In addition to to learning learning new technologies So anyway, it's it's a it's a pretty phenomenal change And I think it's the the change of our century really is this this use of data to drive Progress and productivity Across industries We're at the early stages you think about the the places where you work and the projects you do It's it's a very small portion of The data that's generated that we're really using to its potential Even if the the technology platform were to stand still today It would take us a decade or more To fully take advantage of the existing technologies given existing data sources that we have And and really achieve their potential and the platform isn't standing still We will have new projects new things coming along At a at a at a good pace Hopefully at a manageable pace. I mean, I think there's a natural limiting Factor that the degree to which people can really learn and adopt to new technologies is it is the degree to which they can become Standard parts of a platform, you know, we can't we buy by nature. We can't get too far ahead of ourselves So it's it's it's an exciting time Lots of new stuff We're seeing these days people always want to know what's the next big thing and I don't know nobody knows, right? That's that's the that's the cool thing about it. It's it's unpredictable We as a crowd know once we decide and we haven't decided what the next big thing is in technology I mean it's clear There are some exciting things in the near term a lot of advances in in hardware You know, we're fairly certain that Memory is going to become a lot more plentiful in the next few years all the hardware hardware vendors are telling us that We're seeing tremendous advances in machine learning, you know, AI deep learning sorts of sorts of methods You know, it's unbelievable to me that the the degree to which speech recognition and image recognition Have improved in the past few years and I think those That sort of technology carries over To lots of other industries to recognition tasks, you know, recognizing Marking marketing opportunities recognizing fraud Recognizing, you know, cyber Violation, you know, it's invasions You name it. I think I think we can use machine learning in a lot of cases We're just beginning to Disgrace the service of the use cases there and the technology is also The the our understanding of it is it's still young So that's that's an exciting one to watch but there's lots of other things that will come that we can't imagine And it'll it'll be be exciting to see those In the coming years. So my voice is fading. I've you know yelled at you probably long enough I'd like to Hear some of some of your thoughts and questions You know, the community really is in charge here and really does Drive this this this world of software now So so let's hear here where you'd like to drive it. Yes go Everybody says that Everybody says they're the back what everybody says there's nobody in our region who's qualified And you I don't think that's true. I really don't I really think everybody reads the same news stuff these days Everybody's everybody's connected to the same open-source projects Everybody feels like they're in a backwater and that's I think just that the fact that things are changing so quickly What I'm interested the question is What's what do you see is like the next big things that are on the horizon? I just told you I don't know Next Next week next month. So I mean at this at this conference, you know, we saw And I don't think how many people here use in I Mean the challenge for for flink frankly is Spark has such tremendous mindshare That flink has to provide something that's sufficiently better To to you know it because spark's got tremendous numbers of libraries that are that are growing It's it's it's got a lot of a lot of activity and it does a lot of things that that flink doesn't yet do It does have latency issues for stream processing You know, you know, you're not gonna have trouble with spark Getting, you know sub-second latency. You're gonna be you know, we have a few seconds latency Whereas flink you can get down, you know into into some some number of milliseconds Although on the other hand how many people need that And and I think we've certainly seen a cloud era When when faced with that choice most people are happy to say a couple seconds is fine I really need this machine learning algorithm. That's implemented in spark You know, I'm not gonna try to know I'm not gonna pick any new new hot thing that there's a lot of things out there That there could be the next thing but it's really not me to decide it's it's it's it's for the community I've never been a a picker of those kinds of things much. Sorry Back here you were early to raise your hand You yeah, it's I mean it's all that's fair, right? I think you can't If you make something open source Under a license that permits people to use it however they want and then they use it however They want which includes selling it and making money off of it and you get mad at them then you're insane They're they're they're doing exactly what you permitted them to do So you need to be comfortable if you're creating open source with people other people benefiting from it And if you're not don't really sit as open source or put some sort of clause in the license That prevent the things that you don't want to happen with it So I don't I don't have a lot of sympathy for that. I don't I don't think People owe the community Giving back people should contribute because it's useful to them They they get their patches fixed upstream They become you know, there's lots of motivations lots of reasons to contribute besides guilt and I don't think we need we need that one That makes sense Yeah You chief ethical officer seems like a reasonable idea. I mean certainly some sort of ethical review board But it also needs to be not just an encapsulated function. I think it needs to be throughout the design process you need to be thinking is this something that People would trust and I think the core of trust Is are you doing things that people would expect you to do reasonably? Or are you doing something that's going to surprise them and they're going to go what you did that with my data? And it doesn't matter so much what your you know license agreement says People don't read that. What matters is their expectations and so people It's that's the first thing to understand is what do people expect you to do And what you know what do people expect is reasonable So people if you're you know using a navigation program People might think it's reasonable to use your location to Identify traffic ingestion But to use your location for other purposes sort of seems off limits And doesn't matter what the what the license agreement says for for your location data And I think that goes for you know a crossing. So it's a it's a subtle thing There's people who are who've done a lot of thought about it In the healthcare industry they've been working on these these issues for a long time About how you still manage to aggregate data and do research At the same time is as respect people's privacy And that's a very very similar problem to a lot a lot of big data issues Yeah, I mean I think I don't think you it's a matter of being agnostic It's a matter about caring about your future if you if you you know offend your users you're gonna lose them And you might lose them to the degree that your industry becomes outlawed if they get angry enough And so I think it's something people need to take very seriously You know there there can be real backlash. We haven't really formed clear legal policies about about data protection in most countries most jurisdictions that that's still happening and It's the more that people offend the more that institutions offend people sensibilities Then then the stricter the legal climate is going to be so it's It's it's up to us To make sure that we we don't offend people that we do respect them that we do build their trust And that also it's going to keep them coming back to your business I I hear a lot of people who will say they don't want to use the same Vendor for their email as their calendar because then the company will know too much about them And that's that's a failure if you're if you don't trust the company that's storing either of your you know Email or a calendar with that data and not to abuse it That's a problem That that that you know those those vendors are not doing a good job of building your your trust that they're they're doing something reasonable Anyway, you guys could go on with all there's a lot a lot of paranoia you hear from people out there I don't know what you know the practices people have and I think all of those are indications of failures of trust Ted And I think that what we have is a responsibility to our customers to provide tools that Satisfy the needs they have which then are to behave at that If we don't provide tools that allow people to do that then our customers will fail and we will fail But there will be market pressures on us if there are ethical pressures on our customers I Know so I mean the selfish vendor perspective is if we want this industry to grow so that our market can grow as a vendor Then we needed to grow responsibly or else it's going to be limited By by its own its own failures And so that's a that's a sort of the clout era perspective on that I mean we're in some ways the the arms dealers here And and we want to we want to sell to the good states I don't know. That's maybe not a good metaphor, but Yeah Distributions don't kill people Something like that. So what would be an impactful use case that would have surprised Doug cutting Somebody came to you say I'm using it for this and you're like I mean the healthcare ones I always I always find pretty pretty wonderful We did a thing with Cerner Where they're gathering data from hospitals And using it to better predict which patients might get sepsis Which is a you know can be a fatal disease and and then Identify those people and and say you know treat this person especially because they they're a high likelihood to get sepsis And by their statistics, they've saved hundreds or more of lives Through the hospitals because they just they provide this as a free part of their data service They provide data services to hospitals throughout the u.s. I don't know if it's in the world I don't know how big Cerner is but so that's a neat one The genomics stuff is fascinating and I think it at again in an early stage We're gonna see see decades of progress there I was at a talk recently at Santa Cruz UC Santa Cruz in California and they are getting the sequences of tumors in kids that have been very resistant to any standard treatment And they do the school thing that they look at the The the sequence of normal cells in the in the child and sequences of the tumor And then they diff those and they find out what the mutation was And then they search a database that they're trying to build of mutations and find out which treatments were effective against that mutation and turns out that Actually knowing the mutations for one thing is a much better way to classify Tumors then by what organ they're in or what they look like Be and that in terms of classifying them for treatment what treatments are effective and then they can also start to understand I Don't I'm not enough of a geneticist or a biologist the pathways That these genes take in being expressed and going from I guess who knows it This goes from DNA to RNA to protein I think is the sequence and there's a there's a sort of a network that each gene has that's unique And so you can have once you sort of understand which ones are in play in a given Tumor even if it's a unique Mutation if you can if you can plot the pathway Which again is data. It's just sequencing these various things Then you can try to find a drug that blocks a particular step in that process And kill the tumor that way And they're having a lot of success There's still a lot they don't understand but it's pretty exciting to see that they're able to turn You know that can't cancer which is the you know that the sort of Let's say it's like the grand problem of health care that's been around You know, they've really been difficult to get a hold on and and they're kind of getting at the root cause of it And treating it as a data problem And I think I think we're gonna see a lot more of that So those ones are always very exciting. I also love the the sort of the you know that the the tractors and stuff like that these big Caterpillar trucks that are you know have hundreds of sensors on them streaming data back to to Peoria, Illinois and And they're doing predictive maintenance on that and the same thing Airbus is doing the same stuff With with jets and you know Tesla's doing that with cars That's that's pretty to see that we're improving What I think of is you know not traditionally high-tech industries and with with high-tech That's kind of fun to see Okay, did I distribute my questions fairly around the room I don't want to Mm-hmm Question is do I think the same thing can happen with open data as as open source? It's a little trickier with with with with data which especially I mean it the The the data sets that I think are most interesting Are things which you really don't want to be open that have that have personal information about people I Think open data is is awesome I think you know most government data sets should be open You know bidding processes a lot of a lot of things which is definitely in society's interest to know The governments haven't traditionally published and which can be very easily published And so I think open data data is a wonderful thing, but there's also a lot of great data That you can't just publish freely that you have to You know anonymize somehow And I or or share under some you know legal agreement or there's a lot of mechanisms That we we need to and we need more improved mechanisms To share data But that's that's again again industry industry after industry We care the most about data when it touches people and and people also You know it has the most value then But it also has the most value to do damage to those people's lives And so we need to control it and so I think I think that's that's a tough one for open data Yeah, no, I so I mean this interesting this This database at U. C. Hennacher's that they're building they have a global public database Of mutations and it's indexed by mutation. That's the only thing that's published They so you can't identify The patient in any way you can only identify the mutation that the tumor had so that's that's an interesting case of where they can have a I Guess I guess you call it an open data set That's globally available, but which is completely Anonymized about individuals. So that's when that's possible. That's that's that's a wonderful thing to be able to do So I guess that's it I think I'd rather get some sleep tonight Thank you all very much