 So, we're talking about mining public data set using open source tools, but first, if he worries about me, my name is Alex, and I'm from Seoul, South Korea, as you could guess, and originally I'm from St. Petersburg, Russia, I graduated maths there many years ago, and I'm a co-organizer of Ledger's Tech Meetup in Seoul, so if you happen to travel there, let me know, we can arrange something like this there in Seoul, and I'm also a committer and PPMC member of Apache Zeppelin Project, which I'm going to be mentioning later on, so please feel free to contact me later on, and I will put the slides online so you don't really need to take pictures, you can just get the links from there. So, I'm software engineer by training and there are all types of software engineering out there, but today we're going to talk about particular type, which is data engineering, I would say. So, first, I don't really think I need to persuade you that this day's data is important, so it is product or by product of many different things in our lives these days, and a lot of big and interesting and successful projects and companies were built purely on data, and at least there's just some examples out here, and yeah, all those companies, despite they do stuff in real life or in internet, they actually just data-generating machines and they know how to operate those machines very well. So, the IoT is blank there because the subject of this talk of this conference, one of the things is IoT, and I don't know really clean winner in that area, like successful project, build an IoT data, but I'm pretty sure there's going to be some, there's going to be another name with another logo soon. So, yeah, I would love to see it there and hopefully one of you guys can do that. So, a little bit of the context for this is like size of the data is growing and size of the public data is constantly growing, there are more and more data available out there and there's going to be even more in a year or so and you can't stop that, so you better master it right now and jump on the train earlier then be later on. So, there's going to be more and more data products, so it's basically anything based on data and be that research or a company or actual service and with this number of tools available to crunch the data is also growing. That's why it's kind of sometimes hard to orient yourself and that landscape and one of the goals of my talk was kind of to give you my experience. I spent less than two years being a data engineer in a small startup company in Seoul, South Korea and I've been, so I gained some experience that helped me to pick the right tool for the right job and I hope I can share that with everyone so when next to your data analytics gig or freelance job or actually data job you can use this knowledge. So, the main point for me is like public data means open opportunity. It's public, it's out there, it's open, it's up to you to go and pick that and if you don't want to do that that's fine but I'm pretty sure there's going to be more people wanting to exercise that opportunity. So, that's why three things that are important on that way are what are the data sets available, what are the tools and what is the approach. So, we'll go briefly for all of that and we start with data sets and there are really a lot of open and public data sets out there. I just listed kind of some but it was not an intention to have a wiki page here with all the links, you can pretty much find that but there are some I find particularly interesting and I'm going to be talking about two examples. First is GitHub archive. So, GitHub, I'm pretty sure all you guys know that GitHub is a company and I think they did a great job pushing it further and opening the public user activity logs and putting it out there on Google Cloud so you can just download that and use it. It's basically logs about 10 billion, so they've got 10 million users and each action of the user on their website actually generate a log and they not just monetize it as other companies do and of course they do, they also share it so it's available you can try doing that too and that's definitely some opportunities out there and people exercise those opportunities and GitHub fosters that. They host annually data analytics challenge like for two or three years in a row and those are some examples of projects built using that data so it's basically searching the commit to log messages for particular sentiments and list them and you can see people do all type of stuff and yeah you can pick some interesting ones and maybe make it a little bit better by reading those things or one of the useful one and actually author of that one is from Sol's so I happen to know him. It's project code style conventions analyzed so next time you're going to argue with your colleagues whether should use your tabs or spaces for intention or whether it's break it on this line or that line you can move on from well that's like your opinion and use it to actually data-driven approach and say hey of this language this number of project use this convention so we should be using that that's valid data for it so that's I find something I find that useful myself so and another recent one is kind of live dashboard on the github so of course github internally have more advanced things like that but you kind of from outside can also sneak peek and what's going on on their platform which is one of the biggest hosting for open software out there so it's quite interesting to have a sneak peek and see all those types of things going on and number of them per second and there's a pull request commands and everything and of course there are opportunities for some visualizations out there so this is an example of repository activity by language yeah it's not like contrast enough but there's a link so you can check those are basically repositories of the programming languages like python ruby and others and how active they are number of comments so that was one example another one I find really fruitful that data set I really love is common crawl so have you guys ever heard about that one anyway yeah see that's the problem that's what one of the goals of this talk to raise the awareness of this awesome stuff laying out there so it's basically a crawl of the internet of course fraction of it but quite substantial so it's the same raw material this google is built up on so those guys it's from factuals factuals a small company section a small but company in california by one of the early google employees he left started this company and what he did he was monetizing the structured data so basically whenever you want to know the list of top schools in us rent by state for anything things like that you're going to go to them and buy that and they sell an API and data and everything for data-driven journalism so it's kind of a big thing but they wanted to contribute back to the community so they started this non-profit foundation called open common crawl so they basically crawl the internet and put it out there in the public monthly and it's about two billion URLs per month for last maybe three years and each month it's about 150 terabytes of compressed data set and it's just all the resources you know like all the web the website they crawl all the resources from the website saved in particular work format which is a standard tool for um web archive the same tools internet archive use to build their way back machine to to build all the goodies that they do so that's kind of they play nice in the ecosystem and that's very fruitful there are many things you can do with that one of the example of the projects builds on top of that i wanted to highlight is by ilia from california and that's a search index on top of that data set so you can search the URLs that have been crawled and figure out whether your favorite website or what type of website are there in the crawl and it's basically a huge b3 index of the URL sitting on the amazon s3 and has a nice web API and it's completely open source you can play with that it's python based and you can improve or deploy your own so and it's not only industry project it's academia project as well every year the number of academia papers based on this data set is growing and this is an example of using that data set to take the actual language to build the language models for like translation or natural language processing things so they just take it and convert to ngram model and put it out there and before that if you wanted to do research in that area you got to be part of the university which pays google to get access to their one trillion data set of ngrams they collected from the web but not anymore you it's just an independent individual can go download that and play with it and those guys publish the data set as well so it's not only the crawl it's all types of derivatives based on the crawl like URL index and ngram language model for different languages not just english so it's quite awesome and quite and then another recent project I found so actually somebody's being building a web search engine using that data it looks a little bit familiar I'm not sure how legal is that but they just started so it's good time to jump on so if somehow you're interested in large-scale search engines and stuff you don't need to apply for work in big company you can just take that and you know play it on your own and build something useful so it's totally up to you and that's awesome that was never like that before you don't need to run the crawl on your own and running crawl is hard job so there have been a lot of companies build on how to do that and now it's open so you can just go pick and use that so those are the data sets now a few tools that we're going to be using to the cranks that data and there are plenty of tools out there and they're like I try to classify them a little bit and there are generic ones which is beloved grep and all kind of programming languages with the libraries you can use and that's all good but usually that's hard to scale beyond one machine or type of data you want to crunch then there are all these high performance stuff and low level stuff for that matter those are like hard to use tools but they do the job and you're going to spend a lot of time learning them or getting access to the cluster of machines which can do the thing but recently there's been plethora of tools which I call here new and scalable and those are going to be the tools we're going to talk about and I pick some of them I find really useful and almost all of them happen to be under a patch software foundation so at this point I wanted to ask how many of you guys know about a patch software foundation okay there are a few hands but so if I may push it further like what exactly do you know about that foundation so like can anybody share I've been using Tomcat for a pretty long time right I've heard about Spark and Kafka great great so there's like web stuff there right um anything else many projects open source projects based on their items based on a patch right that's actually good that's actually much better than better answers that I usually get when I ask this question so that's great I wanted just to highlight some that the main point is actually not one project it's not just a patch web server and just not web stuff so there's more than 200 projects out there they're all unified of course by the license they use which is like business friendly license and there are more than three I think it's about 3000 people like deeply involved into the foundation and it's also the foundation shares the particular view on how to build the software in an open and collaboratively and I spent the last year working on the project different projects under patchy and I wanted to say that like since then I consider actually that's the best way to build software so I can't encourage more to go check into out and check those keywords that I highlighted that's something really interesting going on there and we definitely can learn from that in many many ways so anyway somehow a patchy recently become like after Hadoop it become a home for so many data analytics stuff so we're gonna pick just some of that and briefly cover that so and those are the things I wanted to list here and we go briefly through each of them and I'll describe a little bit of experience using them and they play nicely together and constitute a stack of tools you can use for basically any data analytics gig and they're easy to learn in my experience and generic enough for multiple projects so although don't mistaken they're not very simple but the difference with previous generation is that it's much easier to get started and learn them so the first one is kind of spark and it has this star there so it's actually a kind of a star project it's very popular it's tremendously growing has more than 1,000 contributors all over the world these days it started like even before 2010 actually as a research center in Burekli University by a few students and celebrity professor so they're great guys very smart they build it open source from scratch but eventually they joined the patchy foundation they donated the project and since then it grew tremendously it got detected a lot of traction online so what it does it basically provide you a repel interface and the API in multiple languages to a new abstraction like kind of distributed array sitting on a cluster of machines that you can go through and iterate so yeah that's the like really small example of spark problem and that's what I mean by then being easy to get started with this is hello this is like hello world of big data and it's just counting the words in a huge array and there's a lot of there's much more stuff in there there's graph processing libraries and machine learning and SQL so so there's a lot of stuff going on there and definitely worth learning a little bit about that if you want to be in data analytics filled the next one is Apache Zeppelin that's the project I'm involved in it's kind of GUI style a notebook which plays nicely on top of different backup processing systems be that spark or anything else so it's quite easy to set up it looks like this so you can build interactive visualization using that and I find it really helpful to get started with the new project and build intuition around data play with that and eventually build a data product on top of that so it's been a while in development and went through a few phases in this internal prototypes but eventually so it was open source from scratch no no it there were close source product before but it was open source as a spark from as a zeppelin from scratch and recently joined Apache Foundation like been one year and through three major releases so it gets attraction and it's about like it grew from two contributors to more than 70 all over the world and so you can build stuff like that there with the simple queries over distributed data sets so it's quite nice and it has pluggable interpreters that you can use not on this part but many other things we're not going to be talking about them here but you've got a lot of power there so next one is Wardbase so it's basically a library built by University of Waterloo in Canada a professor there so it helps you to work with the crawl and archived data and gives really nice API as you can see those examples of counting top level domains in the crowd of pages that have 400 status returns so you don't need to filter and do low level stuff you can just use this one so that's that's the cool project one more tool is this juju by guys from canonical as they call it service modeling at scale but it's a fancy way of saying like deployment and configuration automation tool which also plays nicely with the whole the ecosystem so it got integration with everything you're going to be needing to go to scale and that's an example of how you will be getting started with that tool so the seven lines of like shell you're going to get the tool and the cluster of seven node machines with the distributed file system and resource manager and the whole spark thing and Zeppelin on top of that so it's pretty simple and if you want to add more machines you're going to be just running the last command it's saying and like 400 or and 4000 you're going to be adding more machines and scaling it out so the approach to use this tool on scale and budget looks like that so you're going to be starting with the Friday night experiment and prototyping your solution on single laptop on Friday night and then if you verify your hypothesis and it works well you're going to be using it on a fraction of the data it's like maybe hundreds of machine to estimate the cost because if you want to go to scale you're going to be needing much more machines to protest for example a crawl of you know terabytes of petabytes eventually so and if that works out you can go to scale and the nice thing about it you can use still the same tools to be able to do that the tools that I listed before and you wrote your software once it runs in every situation out there so that's something different before that you were writing software every time you want to go for the next line here and not anymore so I find it very useful value proposition of this stack of tools so and again it's quite easy to get started with them so they were designed for that so that's that's like a penated stack I was about to share with you with like some takeaways here so there are plenty of data and plenty of tools and a lot of open opportunities so just it's up to you whether you want to start exploring them and I want to encourage you please do so that bit said thank you and I will be happy to answer some questions any questions that's a good question so the question was there are many other tools which look like in the area where zippling is like kind of GUI tools in browser to be able to interact with data and do you get that question a lot and they're comparable but the focus of ipython and jupiter for example is be very generic like they cover all the cases and that's fine but the goal of zippling is to cover like large-scale cluster computing cases in particular and it has smooth workflow working with those things so if you ever set up like spark an ispark kernel for ipython take some time it's doable of course and but with this zippling just works and it's easy to add a cluster there and it's just simpler to use so I'm pretty sure that area is big enough to have multiple projects going on there and there will be different people using those tools for example ipython is pushing scientific community zippling is more targeted to engineers to industry community who are going to data science so I think that's that's a big enough area to have multiple projects and all of them are doing well and I don't think there is like direct competition so one of them like kill another you know I think they're going to be living and serving their purposes any other question I'm just trying to develop some sense of detail so your advice for someone who works with data but not as a data scientist how do I get into this field what can I start on great great how did you do it that's this equation if you're coming from industry background with the kind of more lower level stuff and getting into data analytics how do you get started so usually like if you know c++ you kind of know java in a way maybe in a better way but you do so it's not hard to get that one and as soon as you understand java all this tech I was talking about except maybe juju is in java or scala and scala is kind of like functional java but so so it's kind of easy to get started with those tools coming from background like you and yeah we'd encourage checking out spark and has nice api for java and scala and from there you can check all types of references and I think that's that's the good path so this is the tool itself what about the mindset like how do you approach data how do I train on that right so that's actually quite a big question and I would be happy to give you some to share my thinking about that maybe after the session but there are plenty of online courses and that's what I do myself like it should go through many of them so check it out and there is some really good one focusing on like more math aspect or more like engineering aspect depending on what you want to so we'll take one last question I'd be around so we can talk after a decision so let's go actually I'm working on an automated system which will automatically grade your ss so what I'm having problem is that I'm learning it on my local machine and I'm grading it using sparse vectors so it is running out of memory because I will have that right so what kind of servers I am a student so like was it so that's a good question so like if data doesn't fit I mean if data fits the machine but processing does not so you've got to use more machine at the system that plays nicely with more machines and in my experience spark is something you can easy get started with and it is able to represent sparse data so you basically can read it in the internal form of sparse representation in spark with the one or two commands and then go from there and spark will take care of distributing it on the cluster of machines and that's the best part of it so maybe you can just manually spin up two or three machines with the spark cluster on it and then try read that data and see if the cluster eventually has enough memory to have that sparse representation so I would do something like that thank you so much like center so for those that want to you know follow your tutorials or get more information about this how can they get more out of you all what have you what you're just presented so so I definitely will publish those slides but I like I generally hang around the zeppelin mailing list so if you can stop by and say hey like I'm either there or something I will be happy to answer questions there's also like on a github or twitter you can contact me and I'll be happy to answer more questions and I think I'll be around after that talk so we can talk a little bit more not taking the schedule time thank you so much so thank you alexander