 Gochia, and he will be talking about one of the most exciting and hot areas in big data these days, which is deep learning Yeah Hi So I'm got you there to I'm a researcher at the Swedish Institute of computer science I was supposed to do the presentation with Jim Dolling, but he couldn't come here today So what I would speak about is deep learning on how The platform we are developing can help people to start with deep learning Because the thing is that my team At the Swedish Institute of computer science and we have created a startup and we are what we are doing is developing a new I do distribution It's the first European I do distribution on some aspect of it can be very useful for starting with deep learning To start with raise a question What are the factors that hold back artificial intelligence today? If we look at Bender we can think that it's alcohol on sex, but according to Andre Capati, I don't exactly know who you pronounce his name. If you don't know him he is working at Open IE He and his blog is a quite influential blog in deep learning. So this citation is taken from his blog And he says that there is four factors that hold back Artificial intelligence first one is a compute. It's a compute power. We have so this is hardware This is how much machine you have our power folder your GPU CPU and things like that The second one is the data is that to do deep learning you do you need big data and you need big data that are clean that are Right now it's not so easy to find them the data are kind of all over the place on internet Don't really know where to find them then you when you find some data You don't know if they are really clean and things like that Then the third component is algorithm, so It's what you will do if you work on if you want to develop new Artificial intelligence system on the last one is infrastructure is what do you run? Your deep learning system on So is the operating system, but also where you store your data on things like that on What all platform? address the problem that all platform address is the data problem on the infrastructure program So we'll start with the data problem So the data challenge, how do you find good and interesting data? Okay, if you are in a big company you are spotify or you have the data that are there on you can use them But if you are a hacker or a researcher or you want to experiment with Data you don't necessarily have them there and you look for them and then it's Right now it's kind of going to IKEA, but going directly to the last part You have that are all over the place They are all in box. You have to find the box that go together You have to bring that out at home by yourself and then you build it and then you find out that oh, it's not what I wanted So So how can we help with that? Yeah, and then the other problem are you as I say you bring it back home How do you do that right now is that we don't see but it's not so important It's a line of code that tell you how to download some data from from S3 and Most of the time you have to run it 100 times to have all the data on if one of them Crash in the middle you have to do it manually to find out We are the student a PhD student downloading Whether that And it took him one week the first time and then he Reimplemented the client to download the data and it went a lot faster And then it's if you have that on you want to make them public. How do you do that? There is no straightforward solution now. You can okay. You can put them on Amazon or on Google cloud, but then who pay for that who pay for each time someone don't load and Then you have to make some web page to publish them on things like that. There is No, very nice platform where in few click you can say okay now my data And this is what This is one of the thing we do in our platform. I wanted to make a demo But the problem is that for I can't connect my computer to the video projector and The demo so I will describe the demo and show some screenshots because the demo is what is running We have deployed two virtual machine in my lab And I need my computer to access to them because of firewall problem but so the idea that we had two virtual machine with a full stack with the hop stack so which contain Restore the data in HDFS and then use yarn On thing like that So it's a full ad up stack and we have data in that and we want to share them and we have a nice Interface to to see your data in the in HDFS I will show it a little more when I speak over the infrastructure, but we have data in one of the virtual machine And what we do is just we right-click on them and say share this data here. There is a share the data And then this data are now public and they have a description Which is a file with me file that is in the in the folder and that describe what are in the data and then we on the other virtual machine we can search for the name of this data or For something in the description we use elastic search for the research and we will find the We'll get the result there of all the the public data set that are there and we will say okay, don't load it and Yeah, the screenshot are very bad When we don't load it we say to which project we want to add them and then if it's HDFS we can directly dump it to HDFS or if it's stuff that have the good format for Kafka We can also put them in Kafka And once we have done that it start downloading and we can Start processing on them How does it work in the big in the background? It's a it's a peer-to-peer Torrent like protocol so if a lot of people have this data set in their cluster Then it will go faster to download them And then we'll use the lab that for the lab bats for the transport protocol The idea is that Let that it is not exclusive. So we know that if you have that asset that are public It may not be your priority to Upload them to people that want to download them your priority is to serve the people that use your cluster directly So let that is good for that because it will use only the the bandwidth that is not used by TCP and So if people are not using your cluster It will be able to upload at maximum speed but as soon as you have TCP connection that start to use our cluster it will Go down and use less bandwidth the other interest of lab that is that it's work a lot better than Than TCP over I latency I bond with a connection Which are the connection you have when you have cluster on both side of the Atlantic sourcing like that? I had numbers but Because the problem with TCP is that the congestion windows is Badly reacted to I latency. And so when the latency increase the Con is a throughput at which TCP can Transmit over the network is going down lab that is a Is measuring the latency on this looking at the challenge in latency to know if it should decrease or increase the throughput actually did some So and this is also the advantage that it's under better packet loss because It will not take a packet loss as okay Now I need to decrease the size of my window by half it will say okay There's one packet that was lost but globally the latency is the same So maybe it was just these packets which can happen on very long link And lastly we don't load the piece in order so you can start to process What you are downloading while it's downloading? Hmm So now for the infrastructure part so if you if you want to develop artificial intelligence and you want to be To do the artificial intelligence Alchorism and stuff like that. You don't necessarily want to deal with all the infrastructure under that you don't want to be this guy And For that we propose to use a tons of low on ad up So why tons of low if you were there? To presentation ago, it was two or three You saw that there is a lot of existing system to do Deep learning we choose tons of low because it's the big one right now is the one Everybody use it's Google behind it. So I guess it's help I'm personally more developing the ad up part. So I don't know so much about tons of low Sorry about that And then we think that ad up is a it's good for Because you want to have big that I said see at the beginning the data you need to train your Ado is not so good The reason is that The way people do it is that they have the data in ad up They have tons of low on another machine and they take the data to work on software is they run tons of low They run their algorithm and then they put back the result in ad up. This is not efficient luckily They have been a Now it has been developed a system so that you can from tons of low use data that are in ad up On your platform. We have integrated all of these together so that People can directly run their tons of low job In the plane on using data in ad up without having all to do all these transfer It's not perfect yet. There's still a lot of work to do And this will be part of our future work So I will just do a quick Demo for system the resolution will be shit so Or will I So the thing is that we we run this platform Okay So the thing is that we have a data center in Luleå in the north of Sweden The same place where ad up of their where's Facebook of their European data center I guess it's I just want to zoom out Oops I Yeah, but I use the Swedish Okay, so so the idea that we have a front-end for ad up system And We have a project system where Everything in ad up is a project on inside projects You have a users that have different role and we ensure isolation between the project so If you have data in ad up and they are part of a project You know that people that are not part of this project can't access to it and then We can Go to space nets It's so on this data center we get access to our student of the Deep learning course and they did their Their homework on on it on one of the group did the the space net challenge so it's The idea on yeah, I can and they run it on the cluster. They just went to the plane and In the plane Creating notebook on the idea of space net is that it's a satellite picture of So these are the notebook this is all tons of locos And there's some pictures, so it's some satellite picture of the Rio area On the goal is to recognize if there is building in the picture You can see that when there is building it's isolated the part where the building are I Went to first so So as you can see with this platform, it's easy to download new Datasets on to start working on them with the plane Sticking Okay So future work I only wrote the future work concerning tons of flow because it's the only part I spoke about For platform So something we are working on right now is a distributed sensor flow. So this exists, but this is the we don't think that Implementation is so good right now, and it's not it doesn't work on yarn On the as we want to run it on top of add-up. We want to run it on top of yarn. So we have some people working on that Making that you can you will be able in the future to say, okay I want a cluster with these machines and I want to run this tons of flow program on it And then yarn should take care of that Then that means also adding some things to yarn because we want to be able to run with infidels on on GPUs Which yarn doesn't handle right now right now yarn on the on the CPU on memory So we will we are we are also working on adding this information into yarn So that was saying we have a cluster up in Luleo It's already used so it's called that that is that are That is in prediction that the student on research are using if you want to test or code or platform you can either go to Hops.io and then you could you have a vagrant you can deploy the full tag with a grant on start to play with it We can also give you access to cluster in Luleo But you won't have access to the GPU cards because we don't have enough so we keep them for ourselves Then I just want to thank all these people So we have a lot of alumni because we are a research People so it's a lot of students that I've been working with us and that I've done a value ab work, but as you see we are quite a small team working on this project and We we created our startup to develop it It's called logical clocks And then We are looking for people that want to either use on or system and give us feedback or Develop it. It's all open so you can go on github or Hops.io and have more information about us Thanks Questions any questions? Thank you so much again