 Hello, is that you? Hello, okay, we'll get started all Kevin sorts himself. I'm George Castro. This is Kevin Monroe I've been coming to scale for four years. This is his first year So I'm really excited to do this talk with him. How many of you have been to one of our big data talks before? last scale None of you awesome. So you guys are fresh So let me tell you goes first of all. Thank you for coming on a Sunday morning to a talk. I realize I With all the partying that's going on that it could be difficult to wake up in the morning So thanks for coming. We really appreciate that the way we like to work is We I don't like to go to talks and sit and have a guy preach at me with slides and stuff That's not how we roll. We like to do stuff for real So you're gonna see here is gonna be real on bare metal here on this portable cloud that we brought for you Ken's got stuff fired up and all sorts of public clouds And I want it to be driven generally by questions or You know, we want to know how you guys are running big data if you are what are your blockers? How can we help to make your life suck less that sort of thing? So the way we're gonna work is We're gonna bash you some intro slides Kevin kind of give you some key concepts of The stuff that we're working on then then maybe you guys can kind of tell us what you're working on What would you like to see and then we can either start firing up stuff? Live on clouds or have a discussion on how we can help you get your project to completion. Does that sound good? Or is that okay? No one's getting up to leave. So that must be okay. So with that Kev. Let's get started and Thank you yeah, so in the spirit of George's comment there that we want to be a Work-shoppy very engaging with you. I've prepared 26 slides to kick off our Engagement process. So this is real-world big data. What I mean by real-world is that this is not theoretical This is not academic these all the stuff that you're going to see in these slides and then these demos is Out there publicly available totally free you can deploy like you mentioned a few times to Metal to clouds public or private and there's no no barriers to entry here So what we really want to want you to leave this talk and knowing is that if you've ever Thought about man, I would like to do a Hadoop service or a Hadoop solution I just want to use Hadoop But thought that the barrier was always getting it deployed and configured correctly. We hope after this talk you'll see that that's not Not a big issue So George and I for this closer work at canonical, right? So we use juju and we think it's fantastic But fully recognized that not everybody does use juju. There are plenty of config management and other types of tooling that you could use to Set up big data, but in this talk, we're going to show you how we do things with big data and juju the Cortenance that I have here though should transcend the model that you use the tool that you use whatever I mean, I want to just make sure that we're all on the same page with these basic principles of big data Deployment the first bullet there first of all juju if you don't know there's been a few talks about it And you may have heard Mark shuttle or speak and mentioned juju a few times. It's just our modeling language It's our way to take a complex environment like big data or open stack or any of a number of really complex things and Make a model that is deployable, right? The easy part just putting the software on a system And then make sure that it can relate to things So if I have a service like a name node, I know I'm going to need a data node Right, it's they kind of go hand-in-hand. So set that up so that I can relate to that And then if I only have one data node, it's not going to last me very long So I need to maybe scale up to 10 or a thousand or however many data nodes to make a large Distributive file system. So scalability is something that we also consider important in the model Once you get those and you have some mechanism for deploying and relating and scaling Configuration and all that then these next three bullets become important. I'm especially important to us I hope this is common sense again I just want to level set us all if you don't think of these three things and you're a terrible person because People that inherit your stuff later will be angry with you, right? So the first one reliability when I deploy something today, maybe it's a proof of concept It's a very small cluster and suddenly it takes off and my project has worked and I want to go up to Whatever scale or whatever production environment I might want it would be awesome And it should be the case that if I deploy that again, it works the same as it did before I don't need to tear down my environment and restart from scratch again so the Model that you use and the services that you deploy should be very reliable And that they do the same thing over and over again and that kind of bleeds in the repeatability right if you're On a laptop and let's see containers and you say okay. I'm ready for the cloud Let me take that model that has worked and do it in the cloud Conversely if you're in the cloud and you're like an Ashley Madison, you're like crap Let me put this in a public cloud I want to come down to a private open stack or something that is Maybe more in my control. I get the hardware and everything So you want that model to be repeatedly deployable and repeatably set up and working for you So that's a core tenant for us and then observability right so this is a Sort of a blanket adjective here, but What I mean by that is when the thing is deploying I want to get some status some feedback Can I observe what the thing is waiting on am I blocked am I? Trying to spin up something and saw the operating system waiting for relations to things like name nodes and data nodes I mean it also kind of extends into did my model perform like I expected it to all right So is this the right cloud for me? And for that maybe you would use some form of benchmarking so you would maybe do a teresort or or something to say You know, this is what I thought would happen And it's nice to be able to verify that what you're seeing in the real world is what you expected So these are our core tenants and again, they transcend juju They're just the good ideas for any big data deployment Then we took that a step farther in juju and with our big data Charms and models and we said, you know, there's a big barrier to entry and we've heard from people that are third-party Service vendors like they want to just create a big data service like the next hive or the next Spark or whatever and they said geez I don't I don't want to be a Hadoop distribution. I don't want to support HDFS and MapReduce and all that stuff. I just want to use it I know I need HDFS, but I don't want to I don't want to have to be suckered into deploying or Funding it and what do I give you a big blob just to test my service, you know, that doesn't work Well, so we thought hey, let's try to make this model in such a way that you just plug into it So you get to forget about the stuff on your left right that diamond on the left That's what I call core Hadoop. That's MapReduce or HDFS or stuff like that And then you get to just play over here in your in your client level So if you're a server application developer, that's cool. You can just plug into Hadoop It also works for people that are on the user side of big data, right? I'm a data scientist and I want to you know, I've got this neat model that I want to run on my data I again don't know and don't want to know that's on my field of expertise to know how to set up Hadoop and configure it You know, did I put the SSH keys in the right place that I'd distribute at the hosts around my cluster so that these things can talk to one another So we've tried to model this in a way that at this box here you can forget the stuff on the left That's the easy part right and easy in that we've done it once so we can do it again quickly not easy in that if you tried to do this from scratch you wouldn't be Hurting So that's why we have this little client thing over here All that represents is an endpoint that you can talk to you can get into Hadoop with and you know that as long as your plug-in is with you That you are compatible with the cluster on the on the left there So it's got the right Java version the right Java vendor the right Hadoop version the right Java libraries that you may want to do the utilities like DFSLS stuff like that that you may actually want to get in there and do but this client an endpoint, right? That's not all that great. It's just an SSH bot or something to SSH into so I want to go back to talking about sort of if you were an application developer What kind of client applications and services are we talking about and I'm going to burn through these really fast? So stuff like flume for ingestion again. That's not all that crazy cool But it does let you get stuff into HDFS Maybe you didn't know how to do that before you can now use a flume service Maybe you want to expand it to include Kafka, right? So you've got a pub sub model where I've got I'm subscribed to a topic and I get messages in and maybe I want to send Those through flume and get the HDFS and of course zookeepers to keep Kafka in check So that makes sort of an ingestion story and then we've got some analytic stories that we can start Building solutions with things like pig again pig is very simple. There's not even a service that that goes along with pig It's just a binary But then maybe hive is something important to you So now we can talk about database relations and can we swap this in and out with Mariah or DB2 or Oracle or whatever? I mean you can start to get a feel for why a pluggable model is is special to us So you may say all right. Well, great So I got these little ancillary utilities that can help me, but I read on the internet that spark is cool. Okay Sure, plug spark in we can do that and you may say look spark is cool But I don't want SSH to spark and run spark submit. That's not my my forte, right? I want a notebook I want to type my codes into the internet So we say all right Here's a notebook plugs right into spark. It's already plugged into the you know Main cluster there on the left. You say no Kevin. I don't want I pie notebook. It's terrible I want something else and I said fine use Depplin All right, so we have these services already available for you And you can just start to mix and match your thing. You say no Kevin. I don't want Depplin fine build it yourself We have tried to Sort of survey the landscape of big data and pick out those core services that are we think important to you and We've got those in sort of a Lego building block a Lincoln log kind of fashion where you can put these together either as an application developer maybe you want to I don't know do some analytics with pig and Floom for example, there's something with bang that you want to put together with those two blocks you can use those and Either accelerate your application development or just get to work in using those services So what George is doing is switching over. I'm at the build your solution Slide since I queued them up on that What we want to do now this will take about 20 minutes Which if you think about it is pretty incredible what we're about to do is deploy what we have considered a Real big data solution. It's log analysis, right? Anytime I hear people getting started in big data. They do two things They sort some logs or they they Analyze some logs and then they ingest some Twitter stuff and they're like cool I I got some practicality with my log analysis and now I can read tweets and that's awesome Right, so that's how people usually start with big data And so we put a bundle together in our parlance a bundle is just a collection of charms, right? These these little repeatable service units that you can deploy in a solution And one of the bundles that we have and we're going to kick off right now is the It's called a syslog analytics bundle and it exercises a lot of those components that I kind of went through quickly in the charts these are things like syslog forwarding to Yeah to some listener right and in that case we decided that listener would be a flume agent So our logs make it to flume and then once I get the flume they forward on to any number of scaled-out flume agents and finally make their way into HDFS and Once they're in HDFS, then we thought well now we got to do something with them We're not going to HDFS cat logs, right? That's terrible So we got spark again because the internet says spark And so we now have a few tutorials and a few jobs that you can kind of smoke test these services together From ingestion into HDFS processing with spark and then on the other side We're going to take a look at it from Zeppelin since you guys hated the Python notebook So what we're looking at here for those of you haven't seen an orange box. It's it's No, no, do you want to give the to you on the orange box? So this is a 10 node portable cloud that we take everywhere because people don't believe You know you see a presentation on a laptop. It's like hey, here's your cloud I don't believe you so these are real. These are 10 nodes. It's consumer hardware though So just pretend for these purposes. This is your cluster at work or your data center or things like that But everything you do each of these lights represents a node. So when he starts firing off stuff, you'll see a Power node as the node comes on and then like the IO blinking and doing things so for this for these purposes just pretend This is your cluster at work or whatever But you can also do this on your laptop, but this helps illustrate, you know, what's possible So the reason we broke there is just again with the networking and the the amount of data that we're going to pull down to Demonstrate this. I want to make sure that we have time before the session in so this will actually be stood up What you're seeing here. This is our metal as a service. So again, this is just a representation of of Any hardware you may have laying around in your lab? Maz is our it's canonicals offering to sort of talk to those boxes talk to bare metal and they can put the operating system on there for you It works really well with juju. So those charms and services that I Went through in the slides. I'm can deploy directly on top of those and it works pretty seamlessly What you're seeing here is just this is just a list of the Nodes that are in that box what really gets exciting for you people on the right is that you'll see the lights turn on and then You'll know it's working right people on the on that side You'll just have to trust and if you can't see the lights come on. Um, so I want to flip to a Terminal Here we we have so we did a little bit beforehand that we just bootstrapped a juju environment What that means is is juju has a state server something that it knows about all the other nodes and a In a deployment just a way for to facilitate that communication And so we went ahead and bootstrapped that and it's it's got a GUI of its own that you can look at and see What services are deployed the real magic though comes in these I Don't know 30 characters right or 40 characters this this concept of juju deploying a bundle and again a bundle It's just a collection of charms. It's a collection of these services that we have identified as Providing our big data solution in this case again. It's it's analyzing a syslog information So somebody hit enter for me on this What we have at the bottom this is this is the only command that you have to type once once you're you've decided You want to use juju and you you've done our get started page and install the client That's it and it's pretty sweet what you'll see up here. This is in a watch. You'll see Machines start to get allocated. You'll see the light bulbs turn on up front that signifies that mass has powered them on It's going to install the operating system on there and then juju will Start deploying charms a bundle of charms in particular. Yes, sir Okay, so I know you're gonna go through all the details But what about the configuration for the hardware? So how does it know about the 10 nodes? How does juju know? Yeah, so the the collection of hardware that you have is under Matt is you is Controlled by our metal as a service offering right so mass Yeah, and that what you'll do is you'll register your hardware So the hardware that it works with for example our KVM hosts, right? It knows how to power on and off a KVM box and that works with things that have IPMI or BMC's or Whatever sort of power management controller you might have in your hardware in the case of these things. These are nukes, right? So I forget what Intel calls them AMI AMT. Thank you So it mass has the knowledge of how to turn those things on until you register that is saying Here I have 10 machines and I want you to use them in this environment Juju just comes along and says hey whatever substrate this is I need a machine. I need need something to install on Nope, also also free to set up. I'm so free to to kick the tires with yeah, okay, certainly Yeah Madness is just an example for managing a lot of bare metals like usually this is and a center at Data center size scale using juju with a bunch of machines. You could just say hey, juju Here's a bunch of SSH the machines start using these now Yeah, so you're not limited to using mad just for this bare metal But it makes managing that metal kind of gives an API in front of your in front of your servers as you would with like Amazon or Google or Azure or anything else? Yeah, that's a great point. So mad is nice when you want to be able to power that stuff remotely But like you said, you could just have a bundle of SSH droplets or digital ocean and tell juju that these are available to you And mass is also as a full restful API So you can just also drive it manually without having to actually worry about individual machines. So if you wanted to Get in there and do something The question was is mass free software 100% free software. Yes, sir. Oh I should have mentioned in the beginning everything you see here is 100% free software What not not the box Hardware How much is a box these so we don't sell these the company makes them tranquil PC they sell them this one's about 1215 grand Pounds depending on what the exchange rate is that day. Yeah Yeah If I if I want to test the alpha demo on my own hardware And obviously I don't have 10 machines or how many other in the box Do I need maz in order to manage the VMs on my box or does juju manage that directly? Oh That's a question. So juju could speak to certain pieces of VM infrastructure Mostly just Lexi. So if you have Lexi, which is the Quote-quote hypervisor for Lexi containers, which are just system containers like you'd expect to find They're a little lighter weight there is no direct like live vert or any other kind of connection layer you'd have to use mass for that or Spin them up manually and then add them to juju manually with SSH I have to install maz on those VMs and then enroll them Yeah, so the question about being where does math fit into this thing So you can actually just install math on one machine And then you tell maz these are where my machines are either via MAC address or via power credentials or via live Very easy to say this is where my live vert machine Parameters is this is the machine name and that was so you don't have to do anything custom to these machines Maaz will turn them on Commission them get the hardware information about it turn them back off and then lists in the GUI fur these are available for you install stuff So it's a lot less of a barrier there. Oh Yeah, I'll just relay the question. Yeah, so The question is yeah, the question is where do you put maz this whole kind of scenario? Without getting too much into the to the details of it maz lives to manage your DHCP or DNS for that rack below it So it works best when you have maybe like a V a box Where you have two nicks on it one to the outside network and one to an internal switch where all your machines You're gonna power on are you can install them on the machine where you're running your VMs? It's probably the easiest if a dev box with a bunch of VMs on there. They can manage the bridge there So you could also put maz in a VM managing its VM network. There are a bunch of different configurations for it This they get more complicated as you start Drilling down further though. Yeah, but there are different scenarios to support what you're looking to do depending on what you want to Sure, so the question is if you could use ansible or some other form to create the VMs Sure So there is from a juju perspective juju doesn't really need anything outside of the base operating system So it expects to have a clean image of either Ubuntu or sentos or windows to manage on top of so I'm not sure how using ansible would work I'd like to talk maybe a little further afterwards about how that scenario would look But there would be nothing and maybe I'm going to understand an example, but there would be nothing for ansible to provision Because juju's just using a base operating system So there's nothing on top of that operating system image that ansible would need to put that juju would be able to know If the machine's already on and running and sshd is installed which most are by default. You just give it sshd SSH credentials and juju can connect to it Yes Did you connect to most every public cloud you can imagine? Yeah, sorry I should have made that a little more clear as well So this is just for bare metal, but juju also speaks natively to Amazon GCE Azure dream host rack space joint joint if you can name it as a public cloud it probably can talk to it Yeah, but I see what you're asking for now And that would make sense if there wasn't a cloud and you had an ansible thing that was provisioned machines They gave you back an IP address you could do that with juju you could glue those bits together I do want to make a point that afterwards the charms that actually execute to put the services and stuff there It's up to the charm author to do what he or she wants to do and that's where we recommend you use your ansible Your chef your puppet and then you get the bits on disk the way you want them Does that make more sense, okay My question is when you write the charms is there do you have to write them differently? Depending on what if you're on bare metal or different infrastructure of service especially for a dress management I know when I use Amazon you got to be very sensitive if they're public addresses IP addresses with aca It's very finicky about it. So this is kind of uniform and give you a uniform way of dealing with addresses Absolutely, so for the big data charms that we've written They are totally cloud agnostic. They're metal agnostic So all that we care about is that we are on the machine and then we use things like hostname or Whatever there is logic in the charm that will say what is my IP address? Is it a routable address does it belong in Etsy host and we will configure that as Needed on any substrate that we deploy to so the same charm the same thing that he's deploying on that bare metal box I have deployed in a Amazon box. I didn't change the charms whatsoever. I typed the same command juju deploy this bundle of charms And it works the same that is because the charm author the person that has the domain Knowledge of what a name node for example may want in Etsy hosts Has said, you know try it if config do I have it? Grock the IP address. Is that a valid IP address? I don't know. Okay. Well then use hostname is that valid I don't know let me then read into what is the kernel thing my network address is so the the charm author has has done the The logic if you will to determine how it knows of what IP address it is so that you don't have to worry about it When you deploy these charms to different clouds Yeah With Sure, so Juju is modeling how to provision these machines for you But it also models how to do storage and networking in a side of that so we're talking about using juju against Amazon juju knows how to Create and manage subnets inside VPCs and attach those subnets the services you deploy so the length in which you can model this service The links and belongs in this subnet should be able to bridge to this subnet services can all be modeled with juju So that complexity is there and math also helps you model things like how do I do bonded nicks? Or how do I manage which Nick goes on to what the land and things like that nature? So modeling that level of complexity past the flat network is something juju allows you to do using the native APIs or whatever service you're running against And what about automation just kind of not just the net where it's bonded, but what network you just connect to I mean Virtual Right so the question about How far you go with automation being able to tell where you put networking stuff is and juju does that again talking to native provider So we have a VMware provider that knows how to speak vSphere and all the other components there And you can use juju to say declaring this is a juju construct with correlates to this Vlan inside of vSphere or this VPC inside of Amazon and when I deploy services I deploy them into these subnets or into these spaces which may be one or more subnet or network Vlan or whatever that may be Well, that's possible there with juju. Yeah But great question though. Fantastic. Yeah, these are great questions But I do want to turn your attention to what has been happening since we ran this deployment And I didn't start a timer, but I think it's been about 10 minutes What you can see remember we just ran the one command down here And it that bundle that remember we're getting to the syslog analytics We're ingesting stuff with flume and we're attaching spark and going to run some visualization on it that bundle was comprised of these services Compute slaves for us. It just means the both a data node and a node manager. This will be consider a slave, right? The couple of masters there's the yarn master and name node Some our syslog stuff to forward the logs on some spark to do the processing So these are the services that are there at the top of the status output and again This is sort of ties into our observability. I can see what's happening. I'm down here These are this is a representation of the actual machines So our slave for example that thing that has the data node and the node manager There's actually three units power behind it and I actually we forgot to tell you to watch for the lights But all the lights are on now But they were off and now they're on so you know it worked, right? So this what we're seeing here is what what each of the states of these Individual units providing these services is in you can see our slaves are blocked Why because the name note isn't ready yet? Why isn't the name node ready yet? Because we're still installing hadoop base. So once the name node becomes ready We will set a We'll send a message and say who wants to connect to me Then the data nodes will come along and say I am for you and will Establish that relationship configure themselves accordingly exchange SSH keys exchange it's the host information Set to config etc And so at this point So HDFS master is now setting up. We should be just a couple of minutes away from Seeing what that what that looks like and you can also just look at the pretty version of that if you want to Oh, yeah, fantastic. I forgot we have a when you install it when we deploy a bundle it comes with a GUI right that you can see Pictorially what what services are being installed and it has It has some some amount of Color-coordinated things so you can see that flume is is doing something right now. That's why it's yellow And things will turn green and and again This is the the graphical representation before you say anything I know there's a lot of elephants on the screen We're working on how to differentiate what those elephants are one of them is a name node one of them is a Yarn master and things like that. We're going to put little hats on them or something so you can tell which ones are which Yes, sir you up front random stranger Yeah, when you move these little dots around is it of any significance at all So the question was is when I move stuff around is it at any significant? No, this is the human version So, you know, I like to order mine top to bottom based on what things are doing But you can certainly everything that he did in the command line. You can certainly drill down and Configure different metrics and config files and When you do this you can do a commit and you'd see who changed the cluster and whatnot and that That kind of sort of thing. Yeah But the state is in real time. However, right one thing I like to do is like have this up on like a like a big screen where everyone can see it and then everyone's working on it And you can see what's happening Which is it's kind of cool from a hey, what's it supposed to look like in a grand vision kind of thing So I call this the manager view. I don't know if that Jails with you there This is my view though All right, so it So I understand right now what's publicly available is version one of GGM So and I hear a version two is coming out really soon So if I spin up things on version one, how easy is it going to be to transition over to version two? That's a great question Yeah, I just realized like yeah, I may have given up the get a vote. So yeah, we are coming out with a 2.0 it is a Mostly compatible version, but that we are breaking some backwards compatibility with the CLI with the UI and with some of the deployment Methods if you're deploying on the latest version of juju one dot twenty five dot two You should be okay as long as you're not doing anything into lexy containers because that's changed between What's in the one dot X and two dot X series? If you're starting to plan to do something with juju using today, you're gonna deploy something the next month or two I would recommend using the alphas. There'll be a little sharp and pointy as far as Experience goes, but it will be more in line with what will be coming out But we'll be recommending everyone go to but yeah, that's a good question This is running everything on alpha because we like the bleeding edge for demos Oh, that's a great question. So your your initial question about networking spaces That's something that's just landed in the alpha that's coming out in 2.0 Another thing is the idea of how to manage models So right now it's very expensive you have to have for every topology that we model here You have to have a bootstrap node for every single one That's a machine that you have to pay for that you have take time out of your cluster and the next version We had this idea of models being shared on a single bootstrap node So you create one node that managed orchestration But you can do like a namespace segregation of different deployment topologies in there So you don't have to do a very expensive spin-up. You can have a GUI that you switch between actually go back to the GUI Kevin If you look at the top of the GUI here, there's a drop-down box where it says mas you'd be able to create a new blank canvas Just typing a new environment may hit new and instant new canvas So it makes it really easy to start spinning up different tests of dev environments without having to go through the five ten minute bootstrap process That's a great way to do that So you could say I'm creating new user accounts for you all you get access to this canvas this model You get access to this model you get access to both these models you only have read access to this model So adding this idea of you know user ACLs with the ability to create these real quick disposable models So this is your model your playground is completely isolated from everyone else But as an admin I can see all of your models, but you're only confined to your one little Deployment essentially, but that's a great question. There's a few other things as well. We're redesigning some of the command lines We're making things easier. We're adding storage as they're now But we're we're enhancing our storage support and making more clouds available through the tool So those those are the main the main components And just kind of streamlined in the whole user experience correcting all the things that we had in 1.0 So we're breaking a lot of the command lines making the API more robust things like that a great question Yeah, you also have a question No, that's a that's a great question. So are we running maz and juju on the same machine? Actually, we are I am we are using this this case where the node zero in here Which runs the management node for math is also running juju But that doesn't necessarily have to be the case as long as your laptop or whatever client you're using with juju can connect to the machine It can talk to it So Kevin has on his laptop a juju client running which he has connected to all the clouds and bootstraps So juju has to live on the machine that lives where you are the client in this case This box moves around so much as you do for the client on the box. I like a question though. Yeah Yeah So the question was does juju have a concept of inventory like Ansible does and how do we deal with secrets? I'm just I'm just a guy in the audience So You'll have to quickly explain what Ansible inventories are excited. Don't quite remember what they are Sure. Yeah, so we have a bit of that here. We can't see it now It's kind of scrolled off-screen there But on the GUI we have that idea we don't keep it Juju doesn't really keep track of things across multiple deployments So it's a deployment is a silo that manages itself It knows all the resources that are in there and where they are what they're allocated to with this new multi-model idea You could have multiple Deployments all on a single node where you could theoretically see across them see how these machines are allocated here When we do things against cloud providers, we also do tags and naming appropriately in those So in mass we tag this is what services on there in AWS we apply tags These are the services that have been allocated to machine so from a management view if you're using those dashboards You'll see them as well, but that's Probably as close as you'll get to something like inventory for Ansible on the other side how we deal with secrets in juju Well, it's still a really hard question to solve. We haven't quite encountered how we handle secrets We've been toying with the idea of When you have to supply configuration that is a secret you can declare in the configuration file This is a secret so for read only users They would see stars or something of that fashion, but anyone with admin access will still be able to retrieve that We're also building into an idea resources into juju where you can declare This is a binary blob or some kind of deliverable payload that goes to the service and that could be potentially an encrypted An encrypted blob of data as well. So we're still figuring that out We're kind of watching what are the other tools are kind of how they're doing this I know hashy corpse just come out with vault and there's some other things So we're gonna kind of see how that plays out and what the best practices arrive around there It's still a very tough thing to solve in general. Yeah, but that's a great question though There is absolutely nothing stopping you from saying I deploy my own vault I go in and manage my secrets there and then having a relationship that can distribute those secrets out There is nothing there, of course. We'd like to eventually evolve that into a primitive juju understand so that it's Right exactly, but that today. There's nothing stopping that of course We could definitely do that towards yeah, I don't want to fogar too much of Kevin's big data Yeah, so this is a big data talk. We can go towards the end not a juju maz talk. I don't know what you people are doing. I Don't know what notches. I do not know what that is still Okay, okay Yeah, so that's very valuable feedback because we're interested in there. There are so many services around big data, right? For lots of different facets ingestion and processing and visualization and things So it's very important to us to understand where the community is going and where people are What services people are wanting in a big data solution? So it's it's awesome to hear that kind of stuff We would love to engage folks more and we'll have some links about how you can get in contact with us to If you if you note a service is missing and that can certainly go on our roadmap the most recent thing that happened We were Ingesting things in the Kafka and sending our Kafka messages in the flume and somebody said why are you doing that? What's the point of that use goblin right or use the the follow-on to? LinkedIn stuff and we said oh, thanks for the tidbit and so we turned up goblin for that reason to put Kafka messages directly into HDFS. I'm what I wanted to note again Just that now all the services have become ready. I think we're about 20 minutes in of that deployment. So What we've done here though, I mean, it's pretty it's pretty neat We've we've spun up 10 or 12 or so machines connected them all together and made it a a big data sort of proof of concept that things are working and just to To show you I'm not fibbing and I'll fire up zeppelin and log in So you can see I don't even see I'm gonna expose a couple of services. Oh You're right. I'm sorry. I don't but if you were in a cloud by default we you may see Here's some IP addresses. I'll show you on my on an Amazon deployment. These are obviously local to that box But you'll see IP addresses and ports. We don't just open those up for you If you if you know what you know, you want this service exposed and by default they're turned off and you you expose those I'm gonna grab that 9090 Please silence your phones at least turn the mic off 9090 All right, so we just deployed zeppelin, right? This is our interface into Spark we're gonna run a quick job just to verify that ingesting something has happened that stuff landed in HDFS We're gonna sort some logs real quick. However many Syslog messages have been generated. It won't be that many again because it's only been up for 20 minutes But we'll get to see those We as the charm author I wrote the zeppelin charm and I added some default tutorials so that folks can smoke test things It's it's real important to know. Okay. It deployed does it actually work, right? So we're gonna run the flume tutorial here Zeppelin if you've never played with it. I'll highly encourage you to play with it. It's a it is One of my favorite visualization projects so far and cost you might know this offhand. Do you know is it still an incubator? Status, okay. Okay. Okay. It's a fantastic Apache project. I Bundled by big top by the way We we are Very grateful for that. So we're gonna run through these and I'll just go ahead and click the start button And I'll talk through what it's doing. The first thing is markdown, right? We don't care about that That's just showing you that it's a it has an interpreter for markdown in case you wanted to put some nice headings What it's doing right now shoot. I know this is gonna be small. Unfortunately, this is a Yeah, someone I want to mess with the light switch Again, this is just a sample tutorial That that we're using to demonstrate to show you that flume has worked This is just a shell script that I'm running on zeppelin that will run on the spark unit And all it does is it just SSH is to the machine in a loop, right? It's just guarantees. We have syslog messages. These will these will time out and not Not show any actual data, but it will show the hit in var log secure, right? So you'll you'll we know we have SSH messages here. I mean it has finished We've got this is an HDFS LS just to make sure that we've actually got some flume data This is today's date. So, you know, I'm not fiddling Scroll down a little bit. We're gonna Make a just showing some scala in the spark context the ability of zeppelin in this one. Jesus thing went fast I know right So we're going to create a table right so we've got all these now log data We've ingested them into HDFS We know that this is the directory because when I wrote the tutorial, that's why I put these these data bits So that's coming out of HDFS We're going to make a temporary table out of that for spark to process on and then we get to the neat stuff, right? The data visualization part again, this might be a little lackluster because It's only been up for a few minutes, but there's some SU activity that's happened Hover over that. I wish I had a remote Mouse do Vicky So what's that for issues hover? I got it. Yeah Yeah, and there's so there's our SSH it's right So we just we did a loop of 10 SSH is over And it this is just a nice way to show what What's happened in the the syslog on the machine that we were monitoring that's coming out of HDFS We've got some timestamp and visualization stuff here. These Dates are not valid currently, but I'm just to give you an idea This this is just smoke test the fact that all these big data components have talked together, right? So we generated some messages. We got it in the HDFS. We processed it with spark everyone's looking at it with zeppelin Zeppelin I just want to touch on this one real quick. This is a what I What I really like about zeppelin there's a tutorials of the zeppelin folks have written That Better need to show off as well up Okay, oh wait control V. I typed up space Where's page up page up me? I know right? Yeah, how many people to take to type in a thing? So would you click save for me and then we'll run this thing This is a neat tutorial because it shows w getting some stuff Maybe you're you don't have log data that you want to analyze, but you do have a giant CSV file somewhere I know a lot of people have large data sets from data.gov that are freely available to W get and hit that play button. Yes, sir So again, this this was a tutorial run by or made by those zeppelin folks and it just shows how How easy it is to interact with? Yeah, so this is what's running up a little bit. I'm sorry, so we're W getting some some bank data that the University of California Irvine, I believe is UCI so we imported some of that down down down Again, we're going to make a temporary table and here we're going to do some Analysis of some banking data. I think this is mortgages based by the age of the people I'll show you a whiz bang neat thing about zeppelin. So we've already done the We've already had done the analysis Mortgages owned by 30 year olds. Let's make them I don't know 25 year olds. We can rerun this live And then it'll it'll re redistribute that graph based on the data. So it's just a really neat Interface into in the running spark jobs. So the port spark sequel Shell scala pie spark stuff like that So that's our that's our live demo on the orange box and again this Don't there's not there's nothing in the box. That's fancy. I mean whatever You this is just a representation of metal. So don't don't put too much value on the fact that it's orange It's just a box of Hardware that you may very well have laying around I do want to flop back to a To my stuff it's on a w s Just so you know that again if you don't have mass But you do have a w s credentials and this will segue good into your in the ear thing Meanwhile any questions on the services that we just deployed. Yes, sir DC OS You know, so I just went to a talk that compared cubes and and me so or me so and a couple of other Platforms for supporting big data And I have to admit that I am learning along with you at scale. So I do not have enough Background on DC OS to compare and contrast with juju. I don't know. I mean you seem to have lots of answers I know no, I'll uh, I can answer your question. So when you look at things like DC OS and Kubernetes You're modeling all your infrastructure is golden images You're pressing and rolling them out and it's a mutable infrastructure The difference with juju in the way that makes you if you're to do a compare and contrast you get a lot of The same features, but juju is Mutable by nature designed to do long-running Life cycle management of services over time. So it is a different approach to doing software delivery and maintenance over time Whereas DC OS you're rolling out containers Application containers that you're recycling often Juju modeled a lot of the same infrastructure DC OS does but it does it with a different methodology applied on top So that's what why see here. Oh You comment if I may so I'm familiar a little bit with the methods not so much with DC OS But essentially this is this is your data center resource management solution rather than say just a deployment and Making sure that all the beats are fit properly together and stuff like this Right, so I think juju is solving sort of an orthogonal problem where it guarantees the Correctly out of all the bits and pieces of your stack then you push it out in variety of providers and you get yourself working solution Whereas DC OS is more focused on if my little microservice died How I actually guarantee that it pops out somewhere else on the different note And that leads actually to a number of the problems like if you have long-running services, like you said, right? So in case of say HDFS DCS is actually in a deep trouble because data retention actually is something that you cannot solve easily So it Yeah, but it doesn't address data retention. I mean if you so basically you HDFS They don't know what goes down, right? You need to balance your cluster and this is a lengthy process. It takes time Yeah, honestly, I'm not here to Target the benefits of the marathon, right? So what I'm saying is that this is a different with purpose tools That is all So as far as where juju's sweet spot is Kind of my understanding of it's very good for initial deploys But if you contrast it for example with a mbari, which does whole life cycle and monitoring I know you could spin up gangly or spin up Prometheus or anything like that What is what is juju doing is it well suited for like doing application deploys and Integrating with fbt or maven and things like that or it really is a tool just for kind of infrastructure not code deploys That's another great question. So One thing that juju does that when you see it when you realize it kind of transcend a lot of the stuff is juju's not just inherently a big data tool Which is you know you look at mbari mbari is they put all their expertise behind how to deploy a big data solution We also use juju to deploy at scale ha production grade open stack So of the telcos out there the largest telcos are running open stack deployed and managed by juju and math and their bare metal And it's the same primitive here. So We're going up against in that same vein and barry to big data We have things like red hat has their own red hat open stack deployer And there's also who's the other? Merantis has a company that's like in barry where they just do open stack deploy What we find a lot of customers and people really liking is that? Oftentimes you're not just deploying a big data solution There's other things along that side and juju can be that common language That is how you model those services and how they may be interconnect or don't interconnect each other so Juju doesn't just stop at deployment. It is also life cycle management for services over time, but it's done in a generic language That's not meant to Be specific for any solution But is the common language of all solutions that we've seen deployed are done at Okay, so yeah, so I guess there was two points or one of is the main you're treating your right you're closer to Ansible puppet chefs salt etc in that your general configuration management and could do end to end data center Whatever it is to integrate into but and for the second question one more one more detailed question So do you find there's fundamental issues that are different in doing initial deploy versus doing things that are more workflow Based more deployed base because that's one of things we found that a lot of the deploy stuff You want to have a lot more control of the works over the workflow as opposed to have a tool automatically generate the workflow for you So it's more kind of inherently sequential although you could do it in a in a model based approach It seems like these tools kind of struggle a little and then you have past systems they integrate with there really are specialized towards the code deploys Other great question. I'm not sure I can answer in as much detail as you've explained it But we haven't really found anyone especially of our enterprise customers that are using juju and people out in the community That are using juju now that have had that problem where once they have that infrastructure deployed whatever it is whether it's a big data infrastructure and a container infrastructure like uber net ease or Or an open stack that once they have that deployed they don't find themselves that they need to then start integrating with other paths they find themselves instead getting their vendors to start putting things into charms because Past the primitives of the ploy connect and scale juju also allows you to do those administrative level tasks of you know How do I well as an example? How do you run a terra sword against a big data cluster juju has primitives that allow you to say? Here's how I do a terra sword or here's how I do Administrative tasks and it's modeled in that same repeatable fashion Here's how I do manage backups and restores and failovers and things of that all that stuff is still leveled at that Charms so charms do more than just the initial setup and the ploy it also does the how as an admin do I manage this over time? Last question be a very simple example Suppose you use eppelint to actually launch going to spark shell it would be natural to automate for example doing spark on submits With juju so you could automate Deploying your application through juju Our display dungal has failed us, but we're going to switch over to a different laptop, but Okay, I'm sorry The question would be I could tell you bring up the infrastructure, but is it a natural unnatural thing to use juju? Also to automate for example doing the spark submit so submitting your jobs Ah, fantastic. Yes. So we have in our spark charm that we have created We have the ability to run certain actions and it will under the covers call spark submit. So for example by I think all the default Example stuff in spark is one of the big defaults is spark pie, right? So we have an action that can say juju action run or juju action do on spark You can either point it to a jar that you want to run and it will Put the right spark submit information in The right spark submit information being if i'm connected to a yarn cluster that means spark I probably want it to run in yarn client mode and let the yarn resources do it So that's a flag on the spark submit command. You don't have to know that now You just say spark submit at whatever mode you're in based on the deployment model We will submit your job Appropriately and stick the right flags in all the way down to the number of cores, you know, you can set on spark You can say how many executors how many workers things like that If you have configured your model to say even if i'm on a hundred core machine I only want spark to execute on 10 cores. We will Acknowledge your configuration and when you run a job through juju It will it will set the right flags to make sure that on that submission is adheres to your model Yeah, so and then that that becomes a big issue or it can turn into an issue of you know Where do you put those jars if you're already sitting on top of hdfs? That's a nice location because then whatever other service endpoints that you might have You may want those jars available to to other Select spark slaves or something like that. So let's put those in hdfs and we do have capabilities to put jars Into hdfs and then you can we have the option of setting spark class paths and things like that so that Workers and people will know where those are so it helps to have a distributed file system to ease the How do I put this jar somewhere that everybody can get to? but yeah Does this play this approach play well with something like cloud era manager Yeah, so that's very interesting. Um when I talked about the the pluggable model the other side of the pluggable model Right, so the there are two people that we have discussed one is the application writer that wants to write the new Hive or the new spark the other is the user that doesn't want to deploy Hadoop or package it or support it. They just want to use it And then the third class of people is are the people that are on that side with the diamond that say, hmm I want to make sure that I run well with cloud era I want to make sure that I run well with hortonworks ibm map bar whomever So by having this this plug-in sort of interface you can swap the core hadoop out for something else And then you can still have your services on the other end So I know that this is a roundabout way to get to your question So we we do fully envision that you will be able if if vendors are Supplying charms to to Provide a hadoop core. We will Plug either side of the of the model make that As decoupled and and easy to to plug services in specifically for claudera manager, right? This is a service and I was it rebranded claudera director or is it the same A similar thing do you Yeah, so so this is the thing where you install this management console and then you can you can install Other services like I want claudera hive and Whatever in your deployment So what we've actually started work on charming claudera manager so that you would deploy that and then let claudera manager take over And do the actual service deployment from within that if you wanted to and we haven't finished that But it's certainly just like I mentioned before we're interested in what community folks are looking for and So we have heard claudera manager in the past and have started Charming that and what that looks like in a juju environment But it it's still in the works if you will Little comment to the claudera manager. Unfortunately, I'm intimately familiar with that And uh, uh, claudera manager is not compatible with standard linux life cycles of the demons You have to write these little stupid python wrappers in order to actually control the stuff you do there. So Yeah, well non-free. I'm not even getting there. So it's technically inferior All right real quick. So I want to announce something before we wrap up. First of all, thanks for coming We had lots of really great discussions. Was this value valuable to you guys and not okay. Nobody hates us. Good. All right So we have this thing called developer dot juju solutions. That's the website You're here right here and we want you to play with the stuff that we showed you here So you go here you fill out what you want to use juju for Whether a little research project you want to kick the tires you want to do that We give you 10 Up to 10 nodes Okay, right. So by default we give you we give you like 10 m3 X larges or something like that. Um, and we auto reap in 24 hours You know so you can mess around and do all that kind of stuff But if you want to do something bigger than that You know more serious than that talk to us because we want to ensure that we're reducing the friction It is to get you guys to do your stuff So if you want to do a benchmark on a certain piece of infrastructure or do anything that would be interesting Please talk to us. We're we're dying to write up your story and tell the internet on how we helped you do something like incredible so Again the sorry about that the the url developer dot juju dot solutions and Where can we find you big data team specifically? Yeah, so we Uh, we hang out in free note a lot. There is just a generic juju channel So if anybody's an irc on free note just pop in and say hey, let's I want to talk about cloud error manager for example Or whatever you want to A service that we may be missing and then we can fork off into a Channel and we're happy to chat with you there. Um, we do have yeah, so we've got this blog up This is we're all our repository. Uh, first of all, if they're blog, it's our blog of what's going on We'll have a uh write up of what we've learned at scale thus far But there's also, you know, if you get started that'll show you all the repository of the of the of all of our charms All of our bundles how you use uh, how you deploy things like that. So our blog is Big data dot juju dot solutions the free aws creds is developer dot juju dot solutions The slides have all this information in them and they're on the scale website So don't uh, or they will be so don't feel like you have to take pictures or remember things I mean finally if you if you like juju and you say hey, man, I want to give this a try There's our get started page. Um, again, it's in the slides. I so you're welcome to pull those down. This is just um, you know, here's a couple of Type this in and you can you'll have the juju client and now you can deploy assist log analytics bundle So i'm sorry. I didn't get to the we didn't we weren't able to show the AWS because we had a foobar and now we're out of time anyway, but I'd be happy to Show that off believe me it works in the clouds. You don't have to rewrite the charms for the cloud That's the exact same bundle on any substrate that we deployed to That it works in the cloud we also have some benchmark That actually this guy here did a lot of work on benchmarking and so we have this concept of you know Let me run a terrasort with 10 node managers. How's that run? Let me run it again with a hundred node managers How's that run right because we can we we have the ability to scale Relatively easily so we scale up run another benchmark check the status or I'm in AWS and man it's expensive Let me try gce. Oh, no, that's more expensive. You know and so you can sort of Move around clouds you can go back and forth between local and and cloudy environments to see From a benchmark perspective what what that looks like and so all of our bundles have the Capacity to have a benchmark charm attached to it and that just gives you free benchmarking of those workloads So He said yes We are out of time. I don't know if there's a new speaker here. Um, thank you so much for the great questions This is awesome. Really appreciate you and the slides will be online shortly. Thank you so much Checking mic one two three check check check Checking mic one two three Checking mic one two three check check check Checking mic one two Checking mic one two three check check check Checking mic one two Check check mic one two Check microfoam Hello testing hello Testing that's good Hello testing this one's better Hello testing Testing Okay Testing testing hello, it's not worth it. Now this one's on but it's Hello testing Testing hello Hello testing testing testing hello It says zero db The gains can be off. I think Testing and it's better Hello Testing I think it's off again. It's on the top. That would that would be good. I would explain this Yeah, it's working So one two hello testing one two one two. Hello. Hello Hello, hello Hello, hello, hello. Yeah That should do it. It's not on this one's still not working Hello. Hello. Hello. Testing. Hello. Hello. Hello. Hello Hello. Hello seems to be Yeah, it works testing hello testing testing testing Networking testing. All right, that's fine Pray turn it down a little bit Yeah, actually bring them a little bit down. I thought pretty loud. So Yeah, testing hello, that's good. Yeah, yeah same for me. Uh, hello. I don't think we even need mics to be honest, but Hello Yeah, that's you've gone this far Actually, is it changing at all or And it always okay sounds like it Yeah, yeah, it's pretty loud Yeah Hello people welcome. Um today's presentation we um It's about flip board is a social aggregation mobile app that aims to transform how people discover view and share content by Combining the beauty of ease print with the power of social media This session will give an overview of the data life cycle Ashish and Rob will discuss the data strategy and architecture the power of python and in-memory processing and the role of cubo and flip boards python sdk and other cloud technologies to quickly eat to quickly and easily assess data analyze it and feed it And to other models and today's first speaker we have Ashish. Yes Thank you. Uh, hello folks. My name is Ashish. So I'm the CEO and co-founder of a company called cubo So today's talk is going to be divided into two parts. Uh, the first part of the talk which I am going to be presenting Is going to be talking about Generally the big data landscape some of the technologies and open source that have emerged there And I will also focus a little bit about uh, when you're running, you know What are some of the issues that organizations would face when running at a very very large scale and Especially how cloud helps in that and after that I'll pass it over to Rob who will be talking about in depth of all the goodness around python and In you know in memory processing and stuff at flip board and that would be a very very strong use case of Of how all these technologies are used in real world environments and in you know in In in in helping there. So to start with a little bit about myself My name as I mentioned is Ashish. So I'm the CEO co-founder of a company called cubo Essentially we offer a cloud based platform for big data. That's what cubo does Um, and before cubo, I was at facebook I was instrumental in building out the data infrastructure at facebook I led that team for a number of years In the process also created apache hive. That was one of my inventions and my co-founders inventions So having in this environment You know when big data was not called big data You know from from that perspective I've seen this industry grow and seen, uh, what are the Uh, what are the advances and changes that have gone in in in this area? So So just a little fact check. So why has big data become so important recently A lot of this has got to do with the changes that have happened in the data landscape itself If you go back to the 90s a lot of data that was produced was produced through business applications This is what I would call transactional data. This is essentially data, which we should be sitting in a database Created as a result of filling out certain forms or transactions, you know banking transactions as a canonical Use case on which the whole of the rdbms industry was built on So a lot of that was transactional data but as you as a lot of Things started moving online As the web expanded and the web applications expanded and the mobile applications expanded What we saw is a shift in the nature of the data itself what became more You know what has become more important now in the last 15 20 years or last one and a half decades Has been what I call interaction data This is essentially data which is generated because of entities interacting whether these could be human beings This could be machines And inherently this interaction data is a lot more unstructured Has a lot more volume and a lot lot more velocity as compared to the previous generation data sets, which are mostly transactional data sets And that essentially has led to and you know the the previous generation systems are found wanting In in being able to handle these data sets and being able to provide tools for processing of these types of data sets And that's really what has led to the evolution of of big data Now having seen data infrastructures run and operated For this type of data Of course the most logical thing that people Realizes yeah, you know, you need some sort of an infrastructure that scales with commodity nodes because This is really really large large data sets You really have to think about horizontal scalability not really vertical scalability But one thing which has also become important is as these data sets have grown larger and larger As more data has become available The number of use cases where data is applied has also mushroomed a lot So apart from scalability on Just the infrastructure side a modern-day data infrastructure also needs scalability on what types of users and what types of use cases It's able to support It's not just your traditional sql data warehousing use cases. It is also machine learning use cases Streaming analytics is becoming more and more Important as well, especially with the emergence of the whole iot industry whether it's consumer variables or whether it is You know in the manufacturing sector And of course, you know data preparation as well So all of these use cases have mushroomed. They have become more complex. They have become more advanced So as a result a modern-day data infrastructure not only does it need to Support horizontal scalability to deal with the volumes of data It also needs to support multiple interfaces and multiple and support for multiple different user personas doing different types of analysis So if you if you dissect that further if you go down, uh, you know, if you dissect that data infrastructure for a modern day company further You would be able to categorize systems In in various different categories. So what is and those categories essentially follow a life cycle of the data? data emerges It's created in apps It then moves through infrastructure to collect that data. So that is what is usually called data ingestion From data ingestion it then goes into these different systems Some used for et al analysis et al some used for edoc analysis by ambas some used for machine learning and deep learning by data scientists And some may be used just for developers for developing certain applications and finally the Payloads or the summaries Rising from these analysis are either used to drive applications or they are used to drive dashboards and visualizations so consumers for dashboards and visualizations could be your Now end user end users who are essentially taking some actions on the basis of those insights And you know applications could be anything Could be consumer facing could be applications driving certain optimizations And so on so forth So that is typically how different layers of data infrastructure have to be in place For for all of these things to be covered to really have a platform which can subsume all these use cases and make big data very simple What has also happened is because there was a void of Of systems being able to support all these different types of You know these different categories at scale with big data A lot of innovation has happened in open source in trying to fill Each of these voids So, you know in in almost every In in every one of these boxes There is some open source technology which is available today And in certain cases multiple open source technologies available today, which Allow you to solve that particular box's functionality And when you plug in all these things together you would get a full blown data platform or data infrastructure platform that can That can solve the big data use cases on collection side A standard that has emerged quite rapidly in the last few years is Apache Kafka There are other technologies also available in the cloud for example penises in aws, but these technologies essentially have Really made it simple for collecting data sets large Large data sets, you know generated from multiple endpoints and putting them in a place where They can then be analyzed On processing side streaming analytics has grown multiple has actually undergone multiple different You know Started off its storm being one of the technologies Under the Hadoop umbrella which was used a lot for streaming analysis spark has Recently emerged as a technology that is solving this use case especially because of its Strong roots in in-memory processing, which is something that you need for streaming analysis On batch processing on classical batch processing area Hadoop has been there for a long time hive Attacks that use case quite heavily for doing lot large scale data processing in batch oriented manners Machine learning against spark has emerged as a strong contender there And there are a few other projects like sync which have also emerged And trying to attack that particular use case they used to be a project called mahout under the Hadoop ecosystem Which attacked that use case? And on ad hoc analysis you are seeing also emergence of technologies like Presto and you know cloud has impala and so on so forth which are trying to attack those use cases Net in net what has happened is in the last few in the last I would say seven eight years a lot of innovation has happened in the open source stack where now You have a lot of Options available to plug in the holes of these You know plug in each of these boxes in a comprehensive data infrastructure Which you would be trying to put in place for a company or so on so forth and What we see, you know at cubo essentially we see a lot of these use cases and we see You know essentially see a microcosm of How Different use cases are using different technologies What you know, what are their relative merits and demerits and so on so forth? and Really, you know we You know A modern-day data team essentially needs all of these things in place It's not just one piece of technology that cuts Cuts across different use cases But you really need all these all of these things in place all of these technologies in place to serve your analysts To serve your data scientists to serve your developers Or even to serve your you know end of line of business users So that is all great There's a lot of open source technology available. There are templates that have emerged There are you know architectures that have emerged and so on so forth But still a lot of companies find it extremely difficult to put together This stack to put together a data big data infrastructure And many times in many companies you would see a silo of users Using big data. It could be a developer team using big data or it could be a data science team Using big data, but it becomes very difficult for a company to Really raise the bar and create a platform create a central platform which all of these personas can use For accessing data and for doing data processing at scale Both in terms of scale of volume, but also in terms of the scale of the use cases And and that is where I think cloud in my point of view plays a very very massive role Some of the reasons why companies are not able to achieve a true vision of a central data platform that can solve all these problems Is because One day the infrastructure when you're building this infrastructure on-prem It is inherently static There is a fixed capacity that you can deal with And the the result of that is always that the central team which is managing that infrastructure tries to lock it down Because the capacity is not elastic Because the cloud is elastic It gives an on-demand and it gives you an on-demand infrastructure It gives you a lot of flexibility for that central team to open up this infrastructure for multiple different use cases So cloud plays a very very fundamental role there because of its elasticity because of its flexibility It also plays a fundamental role in being a cell service platform In a lot of technologies in the cloud are catered towards building things which are cell service And really if a company needs to move to a vision where they want to create a cell service data infrastructure It plays a very very central role Now putting this stack together So essentially my message here is that you know in order to really really be successful With with big data, I think cloud becomes a very very strong option to build on top of You know to build your data infrastructure on top of which it becomes a it becomes a substrate Which allows you to open up data to a lot of users and at the same time keep the operational overhead extremely low However, the way these technologies are built For an on-prem data center versus the cloud are fundamentally very very different And I'll just highlight a few differences there to give you a flavor of What it means to make it make this for the cloud The very first difference is The the storage The a lot of A lot has gone into the Hadoop ecosystem to define the de facto storage for For big data to be hdfs in case of You know in case of you know on-prem clusters on the cloud The storage mechanism is completely different object stores are the right place and the right The right technology to store this data Because they do a bunch of things number one They make things extremely elastic So you know because they are super scalable you can essentially store data Small data as well as large data and you can you know adjust Then that way the second thing is by Separating the abstraction between compute and storage They essentially make compute elastic as well. So you don't need Infrastructure to be running all the time. You only needed when you need the processing to happen So object stores fundamentally are a very fundamental building block of the cloud And if big data has to be done properly in the cloud, they should be heavily heavily leveraged because You know, they give fundamental advantages In terms of operational efficiencies as well as in terms of elasticity that can be exploited there Along with the object store the the fundamental building block of the cloud is also elastic compute I think cloud for the first time gives The ability for Infrastructure to fit an application you know We've been all taught about Setting up an infrastructure and then trying to make the applications fit into the infrastructure, you know, I want to optimize this job I want to change this because my infrastructure cannot do something like you know My my machines don't have enough memory or something like that With elastic compute and with the the flexibility that it offers you can on the fly change the infrastructure to food the Application and this becomes extremely handy, especially in big data use cases Which can be extremely bursty which have also a wide spectrum spectrum of use cases So you might need extremely high memory machines for machine learning use cases and very low memory machines for etl work use cases So that flexibility also gives Is very important when you're trying to do big data on the cloud and it's very fundamentally different From what you get on prem With your fixed You know boxes that have a certain amount of memory profile or a cpu profile or a disc profile and so on so forth So When you combine these things together an object store and elastic compute You can get to a platform Which makes big data cell service Which is able to do that in a completely on-demand manner which also feature proves any evolution that You know that happens in any of those projects or any evolution in your use cases Because of the fundamental flexibility of the cloud Now one thing that is constantly brought up as a counterpoint is hey, how secure is my data in the cloud? That picture is also very very rapidly changing aws has a lot of Product features built around encryption So have the other clouds. There are a lot of product features built around compliance And fundamentally when I talk about this, I you know, these are just some of the You know features that have been implemented on the cloud to address security But fundamentally when I talk about this, I always give people an example of whether They think the data is safe in their own data center versus a cloud data center And the analogy to draw here is you store your money in a safe in your home or you put it in a safe in the bank Some of the best practices around security are undertaken by the cloud providers and as a result I feel that this thing The security perception of the cloud is changing very very rapidly So it becomes an even stronger option to build and put together Unified data platform which has all these different engines Plugged in for various different use cases All provided within a secure environment And I think that is where most Modern data teams should aspire to and should run towards In order to make big data successful for the enterprise So I'll pause there. I will stop there and pass it over to rob who will talk about Flipboard and how they're using big data at a much lower detailed level. This is a much more high level picture But you know, he'll talk about lots more details around around usage and real world use cases. Okay. Thank you Can everyone hear me Okay Stephanie some really good points See now that people We're not using transactional data anymore. We're using interaction data So if you remember like no one was using people weren't using phones a while back And now everyone has a device. So we have insane amounts of data Going through our platforms and we need Big data tools analyze those Those large data sets Okay, sorry Okay, so I'm going to talk about how we use big data at flip board and First I'm going to give an introduction to flip board and tell you What flip board is about and what our aim is as far as how we Envision the user experience and then I'll talk about some of the techniques and some of the Products we use as far as a little bit of information about how we use q-ball and a few other tools Let me have it available to us so Talk a little bit about flip board Flip board is a news aggregation our news recommendation platform so We think of football as a place for your interest What this is is this is the first launch experience in flip board So how it works is you open up flip board for the first time And you're prompted to this kind of like this questionnaire and ask you what topics you're interested in And this person choose news crafting entrepreneurship And what we aim to do is take some information Using your usage data and the information we get from this questionnaire and pipe Really high quality articles To the reader and make it to where they have a really great experience in the reading nice long form content and There's another component of flip board and that is we Work in the area of a collection so we you can collect things on flip board and you could add news articles to What we call magazines so you can create your own magazine And your friends can subscribe to it and they could read the things you're reading And you could have your own point of view exactly as far as what you take what your take is on coffee or or surfing Or mountain biking and other users can consume that And it's kind of a it's kind of like a social ecosystem around content so and Basically what we're trying to do is whenever a user use the application on their phone or on the web There we're taking into account their interactions with particular entities within flip board So we'll look at a user and let's say this user reads You know 10 articles We try to do is we we try to take the take the usage data that they provided to us and then come up with A good strategy To where next time they enter the application they're presented with even better material So basically the idea is that flip board learns your preferences as you use it more and more All right, so right now we have about 80 million active readers. So that's e plus 80 million m i m i r and So As far as the data infrastructure or you know, it's part of what my job is i'm a data scientist at flip board So i work a lot in recommendations of these particular entities like magazines articles topics So there's a there's two different sides of the the data pipeline So one side is the etl process and working with hives And getting all the data into our sequel data store, which is amazon redshift And that serves as our analytics platform and that feeds into a lot of our our tools for analyzing ab tests Looking at For our particular let's say our mar number over over the course of the last year And just basically serving as analytics platform and there's a few other recommendations to Products coming out of the amazon redshift data store, but most of it Actually comes from our recommender index, which is Pulled from the data is pulled from s3 and then moved into what we have a graph database And this graph database produces recommendations. So that's another whole nother Topology or pipeline So okay, we have etl reporting All right And here's some of the tools Um, is everyone mostly a python person here or any job of people? A few Okay And yeah, so we use spark spark streaming. Actually, we're not using this for production ready You know tools or anything, but this is something we're starting to implement And spark is actually kind of new to flip board. We've we haven't been using that long We have been using a dupen high for a long time And sci-fi and numpy this these are a lot of things I work with as a data scientist. They're Generally in memory and we do a lot of our prototyping on our in memory You know do a lot of in memory prototyping you could say Okay, so I was going to talk about People ask the question like why why use python for data? so Python has a lot of really great libraries is a really great community around python and it really lends itself well to data analysis because There's a there's a kind of a there's the scientific community uses python a lot and What's nice is that python is a language? I feel that an analyst or someone that may be an expert at using excel or sequel Would be able to pick up python fairly easily and I could I could show in this example So I'm going to talk about the the word count example Which is just a map reduced job for counting the words and the and a corpus documents so You can see here. We just have Two programs so we have a mapper and a reducer and then The mapper is reading line by line in a in a file and it's splitting and then it's adding one it's a creating these tuples of a word and A one and then the reducer is combining these together and grouping them and what's nice about this is that You don't actually have to know anything about inheritance or you really don't have even really have to be that You don't have to be a programmer. Let's say With um, you know a degree in computer science or something you could pick up python And it's something to where you know the inputs and outputs and you know what you want as a result and then The rest kind of comes a little bit easier than than using a compiled language Um, so yeah, you understand the the standard in centered outs You understand looping going line by line and you just need to know how to run this and where to run it from So this is word count in java I'm sorry. This is kind of small, but um You can see there's a there's a lot of You have to understand classes the concept of of inheritance and a lot of these other other entities within this language that May not be as easy to understand to an analyst as opposed to a programmer someone who's who's um, You know went to school and learned in c++ or java and I think now I most Most universities require java as a language compiled language. So this is actually kind of um, something kind of new. I'm not sure. Um Why they chose this but it's kind of like an industry industry standard that having uh, You know everyone you learn java. Um, so So work on java you need to know about objects classes inheritance There's a context variable There's a lot of boilerplate code in this example There's a lot of things to import So it just makes it that much larger of a barrier for for analysts or people that are new And also there could be other people Math people that don't have programming backgrounds that need to learn all this So we just want them to hit the ground running and be able to Start doing data analysis with python. So there are a few advantages to java. Obviously there's it's faster It is more modular. I know that youtube was using python for a while and Now they're kind of moving on to go and using java because it's it's uh packaged a lot easier and easier to send around And like I said more more programmers know, uh java than many other languages because of the Kind of the requirements when you're in school now you have to learn java Okay, so this is the part where I'm going to talk about our data stores our pipelines running from S3 and where the the sources and destinations are and how they get from point a to point b and all the transformations We make with our data So usages by far is the most important component of flipboards data because with that usage we don't really have a product and uh if If an hour or two goes by and we don't get usage data, that's like a serious problem. That's like Everything has to stop like that has to run every hour. And this is something that All their usage pipes into all of our pipes into our products. So we basically can't have products without your usage data So we have two different pipelines one of them is the The redshift pipeline. So that's the one I talked about earlier, which is our analytics platform And a lot of this data lives in s3 And it's in a json serialized format and we do some transformations on it. And then we turn it into Put it into a kind of like a column format and then insert that into redshift And then there's a lot of products that we have that that pull from redshift the other one is a The recommender index. So that is a kind of like a Separate data store set completely separate pipeline And it still reads from the same location it reads from s3 And we have a collector which collects all the data and transforms it and then throws it into kafka And then the the recommender index or the graph database we use Our graph database is what houses all of our recommendations and our latent factor models and Basically all the tools we have for providing recommendations to users So that server reads off of kafka and then updates in real time. So if we have You know some new interactions from a user it may be an hour or two And then we're able to provide recommendations based on that that new knowledge Okay, so a little bit about a recommender index it's in memory. This is the graph database And it's just one giant box. It's about I think at 250 gigs And we run all of our optimizations in memory because it's very fast And a lot of what we do on a recommender index has to do with clustering And we have a few clobber filtering models running on there And we also have every user's latent factors or every user's information on their usage condense into these Factor models which provide us with a good means providing recommendations to users And so we're talking about redshift so This is the This is our etl setup for Getting data into redshift so We use cubo And there's just two steps here. You have a cubo configure and a You can create a hive command running this hive command dot create Ashish is a better person if you have any questions on that And So this is basic this python etl setup. This is basically what executes whenever we want to run a job on hive and You see here. It's just a We have a few attempts that run we have a timer And we have a hive command dot run and it runs this query that's passed in to this execute I'm sorry. This is really small So this is just a general Yeah, so this is part of the redshift pipeline. So this is just a general execution of Of this of this query So whatever a query we have and if you want to run it on on hive it. This is what runs it So i'll talk about the the process from Our you know our data usage our usage data straight into amazon redshift So this is our usage data pipeline It comes in through s3 and these are just rows of json data and Then sd serialize running through a hive job There are some other operations that happen on the data like there's some ip to geolocation user defined functions that get added and Once the transformations happen it gets put into kind of like a culinary format and and stuff back into s3 And then from s3 the job the There's a process that takes it from s3 and then puts it straight into redshift and then from redshift that's where You know a lot of our You know a lot of our analytics platforms consume consume the redshift data store As well as like ad hoc queries and things Um Okay, this is just the process that puts it from uh s3 to redshift um, i'm going to talk about Instance management here, okay so We had uh previously we had a bunch of amazon um Reserve instances on this vpc And we manage them ourselves We actually don't we have about i say three people on the data infrastructure team now and We're having a fixed-size Uh vpc you have to manage everything yourself and there's changing requirements all the time And it just gets to be unmanageable and unwieldy so It's it's something that it really helps us to have some type of um auto scaling and This the infrastructure that we use Well with the cloud it's something that's really nice for us because we spend a lot less money on reserve instances And if we're not using an instance, we don't have to pay for it if we're not running a job on it, which is really nice um so and uh A lot of these jobs you run them on kubel and As far as configuration tearing down and bringing up instances Um kubel does that on the fly so we don't have to manage that as well So that's that's also really nice And so i spoke about auto scaling. Uh, we have a dynamic data infrastructure And so we have all the jobs running on the same cluster if there's let's say two jobs running at one time If there's a new job that comes in um the auto scaling deals with the demand of this new job and then Um kind of comes up with its own strategy to to handle Running these jobs all at the same time So um with running on kubel um kubel is It's the instances are hosted by kubel but owned by us. So whenever they Whenever we're running a new job and we need more machines um kubel Um provisions these machines on our behalf And then decommission Commissions them as the the job is finished And let's say that's a lot of money in the past and this is something that I'm really happy to have Okay So This is a one application we use with spark So um, we have a follower suggestion recommendation within flip board So as you're Going down um your top stories or your um your main feed of articles We use um There's a recommendation module that pops up and it's basically a user recommendation module or like uh people you know on flip board So this uses a spark graph. So what were you doing in the in the past as we were computing Follower suggestions using hadoop and this took somewhere around like 40 hours or so to complete and sometimes it wouldn't even complete Um, and this is a lot of the reason why this has because map reduce is not in memory So it takes a really long time to get through um a clustering algorithm so we moved everything to spark and What's nice about spark is that everything's in memory. So these connections from you know user a to user b or They're they're basically um seen as instead of being on different partitions Each user seen as being each connection is seen as being completely separate so each uh Let's say you have a node of a user This node is if it's referenced many many times That's able to sit in memory so that way you can access it easily okay, so this is uh So we use sparch. So this is talking about the how long it took for the hadoop job to run So 40 plus hours of spark time was 20 30 minutes And treats um vertices individually and uses caching and holds on to the most visited So as you imagine that if uh, there's a user that's very popular on a particular social media Platform they're going to be that user's going to be accessed more than any other user and it's better to have that these uh these Nodes decoupled so that way they're not just living on one particular partition You have to revisit over and over again on disk So this makes a computation time a lot a lot a lot faster No, i'm talking about a user so like a social media connection So there are nodes and edges and let's say if i'm if i'm following you on social media Then i would be a node and there would be a directed edge from from me to you okay I'll talk a little about a little bit about spark streaming and then i'll do a quick demo at the end And uh The demo is of basically just a topic cloud of user affinities So a user has a particular affinities to topics And then basically the more affinity a user has to topic the larger the the noble be And i'll show that at the end So this is uh, basically an interest graph visualization. So our interest graph um Pretty much feeds into a lot of our products and that each user has a topical affinity vector And we use this topical affinity vector to feed you articles topic recommendations and magazine recommendations As well as uh, it serves our our ad platform as well Okay, so this is a real-time demo it uses spark streaming so it listens to a kafka and reads events that um Read events for for like articles. So if a user reads an article We look at what topic the article is in and then we change the uh, you know, basically the the size of the affinity for that user to that topic and So this is a few steps here this is uh It creates a streaming context And reads from an input data stream from kafka and Based on the user's uh time within an article It applies more weight to that article. Let's say if the user is reading the article for 10 seconds There's a larger weight than a user Goes into the article then jumps out of it So the reason why we chose this is because Users that click into an article then jump right out They may be a sign of clickbait or some type of uh content the user is not really satisfied with So when the user jumps into an article and reads it for a while We assume that the user has a higher affinity to that topic and we apply our larger weight and see And i'll do a quick i'll show you guys this tool so Let's see I can get this to show up here so These are the topics of flipboard for this particular user You can see that they read more articles about google machine learning psychology and from these sources huffington post business insider So this is reading from from kafka And it's giving for each User interaction with an article There is a certain weight applied to a particular article or a particular topic based on the article's topic so A few other users here have different um Topical affinities These users topical affinities are more general. Yes. Oh, no, this this is uh, this is something that was developed internally so I think it's just reading off of reddus and uh so This basically just shows you what what the person's interested in. This is like an interest graph for one particular user And you can see here they follow Read a lot of space photography apple news and you know facebook articles And it's just give us a good gives us a good idea exactly like what users are interested in And what this user might be inclined to buy like if we wanted to show them an ad Maybe they're into Photography we'd like to show them an ad for a camera And The more the more we know about users topical fending and their interests the better we could the better job we can do and Providing with articles and content they're interested in So and that concludes my uh My talk. Do you guys have any questions? So we have um, so from the the devices devices sent to one of our servers And from there we have a what we call a flap server, which this is the this is the server infrastructure that talks directly with the phones and all of our devices So from there they're the data is being sent to um To that service and then services services taking it putting into json format and throwing it into s3 And then from there we consume it So the the it's in basically you could say it's not really in real time, but it's in uh It's basically chronologically ordered. So whatever we have in the let's say a recommender index Is all like all of the all the data and all the usage for a particular group of users Up to you know, like let's say an hour ago. So whenever we get new data It's put into um to kafka and then whatever the the last hour of usage is it's get gets put into our recommender index So that way we could you know, I mean users don't generally come every hour or so But we like them to but it's uh something to where You have data up to a certain point and then we just keep on feeding feeding more into it feeding the most recent And uh, generally as rule of thumb, we don't We don't try to analyze like uh, let's say if you're a user of flip board for the last four years or something And we don't actually just try to analyze last four years of of your interests and every article you've ever read We just look at the last let's say like a few months or something or six months of of interaction One because your interest change over time And two because we have to analyze we analyze less data. So we use a sliding window Yes, you're doing the map reduced still on the uh with hives so in a hadoop box or in this case you're saying also It's a spark. Why is that are you trying to move to kafka or Is there some stuff that kafka does that i'm not sure of or in addition to? um, we just use it as a um Let's see. We just use it as a a queue. So basically we just want to read the most recent Usage data off of this queue and we want to update our our recommender index and all of our stores using that so Yeah, just the most the most recent usage data is living on there and also we also use uh sqs. We're trying to move over to kafka but our sqs Queue we're throwing think we're throwing article content on this queue And it's being also being consumed by the recommender index. So recommend index also has our topics So for each article we're able to get a ranking of topics That article is that way we can have a better idea exactly like what the article is about And so we can also send it to the right people who are interested in reading it So Yes, more details as to why you're moving from sqs to kafka. Well sqs is really expensive Um Mostly yeah, um, i'm not sure. Um For what i understand the reason why we're doing is because we already have kafka And if we're going through the trouble of maintaining it and using it then we might as well just Use it for uh, you know one one extra purpose instead of having to pay amazon more money so we Probably spent about a million dollars a month on on uh on aws so It's a lot of money. We don't have to we don't want to have to uh pay more for something we could already We already have the means to deal with right now with using kafka Yeah, I mean we really don't have that many people on the data infrastructure side. So Um, it's it really helps to have uh Have like, you know a few chosen chosen tools or favorite tools if you want to use I think also at a larger scale when your scale increases, you know Is the same reason why people move move from say mpp databases towards her dupen height As your scale increases sqs technologies become less scalable as compared to What kafka and those things will offer I don't know about their reason but what I have seen is Generally, uh, you know as the scale increases you need to start moving towards these technologies Which are more catering towards, you know higher volume data And stuff like that. Of course, there's you know stronger integration with open source projects also that also helps um So I think it's a combination of both but primarily the places where I have seen moving from some You know, these are all message Think of them as message buses, right? So moving from an enterprise message bus to a cloud scale message bus or a web scale message bus Which is where you know kafkas of the world reside usually happens because of because of volume Yeah, good good point More question Yeah, that's a really really good question. Um, so there's a lot a lot of parts there. So Oh, I'm sorry. Let me repeat the question. So she wants to know how we use topical affinity to recommend content to users and she Also is asking There must be more to that besides just looking at The topical content and then giving recommendations, right kind of so there's Basically two components. So if you look at an a high level For recommender systems, there's usually two types of approaches There is and let's just say in news. So there is the topical components. So this is content filtering And then there's collaborative filtering So content filtering has to do with the content of the sample Let's say that I want to recommend a news article of someone and a news article is about big data and machine learning and elan mosque So what we do is we could take the the article and extract topics from it and create what's called a topic vector for that document And we could use that information in the future whenever we want to generate recommendations So you can imagine that you have a topic vector of size 40 000 and a user's topic vector which represents their topical affinity for all different types of topics Is the same same size and you can just perform a dot product and there's two vectors and you can come up with a score So you can imagine that you have for a particular user You have this this vector and you basically match this vector against the last 50 000 documents or 100 000 documents That came to the pipe recently and then you could just sort based on score So that's just the content part The the collaborative filtering part is a little bit more advanced and a little bit more powerful And that it looks at the users that are similar to you based on your usage history and Creates kind of amoebas are like clusters of users around the content around the usage So in order to facilitate this we use an alternate squares algorithm, which is an optimization algorithm and It's known as also known as a latent factor model So what a latent factor model is is a model which could take a bunch of user data, which is let's say um a large um I don't want to say matrix, but You have a let's say a lot of users and then a lot of documents and for each one of those users They have a particular score for each one of those documents So what you can do is create after running this optimization You can create a a bunch of um factor A bunch of factors for each user and then factors for the documents and what these factors represent are the The interactions with the user to the content so And you basically same same way as collaborative filtering as we do with uh Content filtering you match these these factor vectors to represent the user in a document and you you go through each one of these Documents and you come up with a score and then you rank them and we merge both of these models together come up with our recommender system um, and You know the users the pipelining and everything is everything kind of lives on this This graph database, so I talked about earlier and that's where all the optimizations happen and where all the uh The uh, you know the factor models are computed Does that answer a question? Oh, so the algorithm. I can't tell you exactly which one it is, but it's it's an alternately squares algorithm You're welcome Yes, um, so our graph database is internally developed um, we call it podex And it was written in c++ by one of our developers and we have um We have boost libraries written in python, so um Basically you could act on this database Like uh kind of like natively in python it almost feels like and but under the hood it's all running c++. It's really fast It's just one giant behemoth box And all the data lives in this uh this box and it's just a lot just a large large amount of memory on this on this uh this instance And um, that's pretty much all we need for for doing this. Um, yeah Well, this was actually before neo4j is before a lot of things I think I'm not sure what year is written in but it was before all these these different tools so um like a one, uh I guess I could explain like how would be used I go into I create like a database instance or a database object And you know create a uh, so I had this, you know db object and I could just type um db.user and then brackets and it looks like in like a dictionary Grab like a user id or pass in a string like a some like uh Just string user id and it comes up with all right I have this this user which is a node in the in the graph and then I could say okay I'd like to go through all the the views for this user all the the topics or all of the uh the likes for this user and basically any You know their magazines and anything lives within this This user's um, but you know this part of the graph and it's it's really useful for analysis and It's a lot more useful for doing certain things. Um like getting the most recent chronologically ordered documents or or uh, or likes the user had And uh, it's a lot better suited for recommend recommendations than Let's say sequel because it's sequel is just one giant You know, uh spreadsheet almost so But yeah, that was a good question and I don't think he's gonna um I'm hoping he's gonna open source it soon. I've been pestering about it, but If it's maybe it's something he developed that he just wants to bring whenever he goes to a new job or something Or maybe he just doesn't want to have to support it because there's a lot of work with supporting something like this Uh, I think it's probably like 4 000 or 5 000 No, it's not and it just it's just in memory. So it's not like um Yeah, I haven't really I haven't dug through it yet But um Yeah, it works really well works well for our purposes Oh, generally, um, that's written in java And for for etl. It's it's mostly java. Well, no way. No for etl. That's that's I'm sorry. That's in python Um, I can't remember exactly Yeah, I mean it really depends I guess But like for whatever I showed you there at the etl going from s3. That's all that's all in python So Like what types of distributed processing? What are we doing? We're doing high processing Um, I there there are a lot of jobs running. I'd probably be a better question for whenever did infrastructure people Um No, uh, I think we are more of a provider And they use our platform to do that um How did we discover ourselves? I don't know. I just met you like an hour ago Yeah But I guess I you know, I guess, uh, you know, there is a pain point and there is a solution to that pain point And essentially that's that's how we came together. Um, so, you know We specialize a lot more on big data on the cloud Flipboard is using the cloud very heavily and that's where, uh, the inter inter intersection happened So so there's a programmatic interface to kibble and then there's the infrastructure orchestration piece Behind the scenes. So the thing which cloud allows you to do is to create infrastructure on the fly So if you marry the programmatic interface with the orchestration piece you get a very powerful paradigm where Uh, you know, you're everything is created on demand. It fits to the purpose of your application What doesn't happen? Yeah, so this is auto scaling. So, you know, it's not just spinning things up and skin things down It's actually spinning things up and down Based on the demand And so there are algorithms there which figure out, okay, you know, we probably need more machines Or we want to shrink it down and then auto scaling in a stateless Uh, Application is easy in a stateful application where data is involved So if you are scaling down how to cluster, for example, you knock off a couple of replicas and then what happens to that data So you have to make sure that the there's complexity involved around around data placement and stuff like that so, uh that that becomes that is a central part of our of our orchestration piece And uh, when you're using the cloud, so, you know, companies like flip flipboard and all can write applications on top of this Which are doing these higher higher value things where you know, you're creating these affinity graphs and you know recommendations and stuff like that um so behind that application, there's a whole Pipeline of things and one part of that is solved by cubo and the part is solved by the uh by the graph database and stuff like that We are programming straight into aws because Uh, a lot of configuration management software, um is general purpose And when you're doing something very specific to big data, then you need access to the demons and the And the data or you know the the states of the demons to know Those are the inputs that go into the prediction algorithm to figure out How are the scaling has to be done? So so we we use uh, you know, we've kind of built that out cafe So yeah, you can use the cafe library to use neural networks, but um in what context you mean like in the cloud I mean, I don't know. I mean, it's it's uh I think that it like I built a cafe box With kuda and I had to build everything myself. Um, are you guys looking to to go more in that area or so? We do support integration with art But we have not looked at cafe as such Now the platform allows you, um So, you know, you have to draw a box as to what is supportable by the platform and what is not We essentially draw a box around the around hadoop spark Hive And presto those are the four technologies Now you can use cafe libraries uh through Or any generic library you can absolutely use that in kibble However, that has to fit into the paradigm of one of these Infrastructures for example, do you want to hook up write a python streaming job in hadoop or spark using? Uh Using cafe or any of these other libraries you can certainly do that Um, but that becomes beyond the scope of the platform itself So any programmatic library that you want to use like ml. Lib lot of people use ml live with spark You can use that with spark right and stuff like that. So that's programmatic interface That programmatic interface talks to our interface, which is around job submission Um Execution results movement of data and so on so forth. That is the interface that we provide below that is the orchestration piece that happens Yeah, so I guess it depends on what level you enter Yeah right That's that's very uh very focused so like uh I'm sure you can use cafe for many other things but what I use it for is for a deep learning network And that just basically it looks at an image and classifies it into one one of the thousand different categories That is like so niche right now that it's not even useful I mean for us It may be useful for a few other companies like a camera company or like drop cam or something like that But um, I think maybe soon, uh The adoption will be more and more with deep learning and uh But I mean still at the level you guys are at now you could you could do it Um, but I think does ml lib haven't any I think ml lib have some deep learning uh stuff now, but I'm not sure But I'm sure there is something I don't think I don't know if anyone uses that Cafe Oh, yeah, I mean you could set up with the gpu, but you have to do all the management yourself Yeah, if you're doing it straight up like that and it's like I set it up. It took me about a day that's a lot um And actually they killed my box because I wasn't using it so now I have to set it up again So happens and you even you don't uh, you know What's that? Oh, I haven't used docker yet. I've had the pleasure I think you can I use an am I so amazon reserved instance Um to set it all up um It makes it a little bit easier But yeah I think we're out of time with it. Yeah, okay. Thank you very much. Hi. Can you hear me? Okay First of all a few words about me. I'm peter tonic from Hungary Uh working as community manager for at belabit Cisco genji's upstream developer. I'm doing uh packaging support And advocating of uh Cisco genji in Hungary and around the world If you haven't heard of belabit It's an IT security company with a headquarter in uh budapest Hungary, but also have offices In european countries and Since last year also in new york manhattan Today I will talk about Cisco genji, which is not your typical big data tool, but uh, it We are working on it uh to become one and Since last year we added quite a few, uh possibilities To use Cisco genji in a big data environment First of all, uh, what is logging and what is uh, Cisco genji Logging is a recording of events on a computer a typical log message on a Linux system looks like this this one And uh Cisco genji is an in-hand Logging daemon with a uh strong focus on central collection Well, it was uh the focus for quite a long time and still it is but uh It's many uh, we have many more uh features and possibilities Now it can not only uh collect system messages, but all kinds of application data Uh, it can also process and filter these log messages And can can not can not only store Messages to a central location, but also forward to a wide variety of destinations Since last year also many big data Too many big data solutions It's a kind of uh, c3po, but while Uh c3po it knows uh about six million communication formats Well, we have uh We still have to work to reach his Abilities, but we are working on it So how you can use uh c4 genji in a big data environment it can uh Facilitated uh data pipeline to uh big data in uh many ways It can act as a collector a data processor and uh With data filtering you can make you can make sure that uh only relevant messages reach your Big data systems So, uh, let's talk about these in detail. First of all about uh data collection Uh c4 genji can uh collect uh both uh system logs and uh application logs Which Provide uh very uh interesting and important contextual uh data for uh either side and uh As a uh c4 solution it can uh Collect messages from a wide variety of platforms the specific sources uh as it runs on All the different Linux distributions and most uh unique variants so it can read from devlog the uh journal Sunstreams and so on As it's a uh central logging solution It can uh receive messages from the uh network using uh both the original legacy uh syslog in syslog protocol and the new rfc uh 54 uh 24 uh syslog protocol but uh not only These you can use any uh data format uh you like uh to send messages to syslog genji as long Uh as you can as long as uh you can separate these messages like uh a new line or Similar solutions and uh what is also very important that uh Syslog genji can collect not just system messages but uh any kind of application messages It can read uh log files from other applications or uh collect uh data through sockets pipes or uh You can also collect the output of an application if it started by uh syslog genji The next step is uh processing all of this data. Uh And it's very important that it's with uh syslog genji You can process your data quite close to the source. So if you have many machines Uh processing of data is distributed uh to Can be distributed to all of these machines Making life of your uh center infrastructure So you can classify normalize and structure uh log messages using uh built-in parsers in uh syslog genji You can also uh rewrite uh messages and uh You don't have to think about uh falsifying messages, but uh for example It's often required to anonymize log messages Messages uh can also be uh Uh reformatted using uh template uh Scene solutions or uh Analyzers often need uh log messages in a specific format like a specific date format or having the whole message uh in JSON And it can uh be done uh by syslog genji and uh Data can also be enriched like for for example, uh, if you have an ip address uh using gyp It uh the location uh data can be uh added to log messages or Based on the message content We will we we will see with pattern db Additional fields can be added Based on message content The next step is data filtering And it has two major uses First of all, uh filtering, uh makes it possible to uh throw away Servers log messages like you don't want to store uh debug messages Unless uh if it's really necessary for debugging processes for the rest If you discard it and it's also used for uh message message routing um If you have a uh scene system You only need uh security related events uh route it to the Scene and uh the rest can be stored locally on the syslog genji server filtering has uh many uh possibilities Based on the message parameters or uh based on the message content Thanks to parsing And uh there are many ways to find the right uh messages using comparisons white cards regular expressions or different functions in syslog genji and The best thing is that all of these can be combined Uh with boolean operators so the possibilities are practically endless in finding Filtering and filtering messages Finally a few words about which uh big data destinations are in uh syslog genji right now We uh can support Hadoop Uh Some uh no switch where database is like mongo db or elastic search Actually elastic search is our most popular destination right now And the other most uh the second most popular Big data destination right now is uh kafka next I would like to talk in a few words about uh log messages I already uh shown this uh Typical log message from a Linux machine and if you look closer you will see that uh it's a Date a hostname and some text the text part is uh Usually a complete english sentence with some variable parts in it as you can see it above It's very easy to be uh read By a human but once you have uh not just your workstation Workstation log messages, but uh log messages from hundreds of machines Or just one single busy server. It's uh very difficult to find anything Or to create a report or do any further processing with your uh log messages People often feel lost Uh when they first look at the amount of data it looks uh they can collect There is a solution for this problem. It's uh structured logging Uh where uh events are not represented uh in uh with free uh freestyle text messages, but uh Us name value pairs Coming back to my favorite ssh example um You can uh create name value pairs for uh the source IP The application name the user name and so on uh and describe the same event uh with name value pairs instead of uh free text and this is uh much more easy to uh search once uh stored into a database The good news is that uh syslog energy had name value pairs uh Inside from the beginning It was necessary for uh being able uh to uh for for it was necessary for flexible filtering for uh They so date facility priority program name uh and so on all all were all were stored as Name value pairs inside uh syslog energy and could be used for uh filtering. It was just one step further to add uh parser syslog energy and uh this way uh any unstructured data and uh some of the Structured data formats can also be uh turned into uh name value pairs and Used for filtering message routing There is a json parser in parser in syslog energy uh as this logging format is becoming quite popular Recently and uh it can turn uh json uh messages into name value pairs so any uh data Stored in this messages can be used Into in filtering or you can use just part of the field uh to store part of the fields or Create uh higher thing based on field value and so on The next one is uh cfc parser the cfc is comma separated values, but uh It was the first uh type of a columnar uh data which was implemented into syslog energy. That's how uh The name Was born but uh any kind of columnar data can be Processed with the cfc parser in syslog energy the most popular Is used uh is parsing apache log uh access log messages And as you can see in this uh configuration snippet All the fields of an access log are described here with uh Names column names and uh at the bottom of the screen, uh you can see that uh the user name value parsed from the Access log messages is used for naming the five uh Five destinations where uh messages are stored The most most interesting parser uh is uh pattern db in syslog energy. It's uh A message parser which can extract useful information from unstructured messages into name value pairs and it can uh Not only extract uh values, but uh us it has to know uh the message for uh be able to uh Part it uh it can add also uh status fields based on the message content Uh and uh it can also classify log messages just like uh log checks can do On a typical dbm system Uh to get it working one needs xml files describing uh the log messages Uh some of these are uh on github uh ready to be used This way an example uh and uh Coming back to the ssh login it's uh login failure uh the user name and uh the source site yet are uh actual fields expected from the uh log message and uh That it's uh the space this is failure and the action is login. It's uh these are fields which are based On the message content and it it can also be uh and as it's a failure it can also be classified as a violation Uh in the upcoming version of uh ssh login g there will be some more some additional uh Parting possibilities uh one is for uh Parting name value pairs out from log messages and another one is for uh parsing The audit log message log format uh into name value pairs and um that way you will be able to create alerts uh Reading your audit logs for example and i mean anonymizing uh messages is becoming a hot topic recently as uh there are uh many uh regulations and compliance requirements uh which say uh which declare that uh what uh can be logged and what must not uh be in in the in log messages for example in pc id ss uh credit card numbers are uh not allowed to be uh logged or uh in europe there are many different privacy regulations uh and often id addresses or user names uh are not allowed to be uh logged in locating sensitive information uh it can be done uh in multiple ways um One is uh using regular expressions uh when uh there is a for example credit card numbers or id addresses can be uh located using these techniques uh which uh works uh Any kind of log messages on the other hand uh not just uh known messages on the other hand it's quite slow it's uh not an efficient way uh to find information uh one can also use button db for uh locating sensitive uh data which is uh very fast on the other hand uh you need uh description for all of your uh log messages at least uh where sensitive data can arise There are multiple ways how you can anonymize your log messages uh the simple way is uh to over overwrite sensitive information uh with a constant which is uh simple fast but if you need to uh analyze your log messages and follow sessions uh in your locs then uh it's better to use hashing so uh the original uh data is always written by hash uh which so you don't see the username or ip address or any sensitive data but you can follow as the hash will be the same if uh the same uh data of course uh in your uh locs again uh syslogang is originally implemented in c as uh c uh makes it possible to uh be uh a high performance application so it can process uh many more logs than any uh logging solutions returned in interpreted languages uh on the order on the other hand uh not everything is uh implemented in c and also rapid prototyping uh is much more easy uh to be uh done in interpreted languages so uh last last year we uh started to implement uh language bindings into syslogang for no uh um destination drivers can be written uh in uh non-c languages uh the core of syslogang supports now python and java and uh luah and purr are uh in the syslogang incubator the syslogang incubator is a sibling project of syslogang so uh if someone uh writes a module for syslogang the first step is uh to include it in the incubator where experimental modules are available and once it's matured it can be moved to the syslogang core with with the language bindings uh the interpreter is embedded into syslogang this has some speed advantages and it also makes possible proper error handling it's also possible to use external applications uh on the destination side but in that case there is no feedback towards syslogang if anything went wrong if uh the interpreter is embedded in that case uh error proper error handling is uh possible uh as you might be aware uh most of the big data applications are uh written in java uh c or python clients usually exist but not all the all the time but even uh if they exist uh java is uh the official client for uh big data solutions and this is what is maintained together with the server component so this is uh so why uh we decided to develop uh big data destinations in of syslogang in uh java to use these destinations it takes a bit more uh effort than uh with than usual as uh our uh java based destinations cannot be yet uh included into uh different linux distributions as with uh most of the jars used by us and also the build tool cradle uh is not yet in the distributions but uh hopefully in the coming month uh this will be also fixed if you would like to uh try this uh my blog goes into detail how different uh linux distributions uh work from this point of view uh if you want to configure this uh want to use syslogang you need to configure it and uh my first advice is don't panic uh the syslogang configuration looks quite a bit scary uh at first sight uh but uh if you take just a few minutes you will learn that it's uh not that scary and it's and it's actually simple and logical it has a pipeline model uh has many different building blocks like sources, destinations, filters and so on and uh once you define these uh blocks then you can uh collect them uh with lock statements into a pipeline here i will show an example configuration first of all uh some global options uh syslogang.conf starts always with a legend declaration in uh this case uh it's 3.7 then usually there uh must not necessarily there are uh some includes uh scl.conf stands for syslogang configuration library as uh we have some uh configuration snippets prepared and bundled with uh syslogang which uh you can use uh in your configuration for example for locating credit card number uh there is a long and ugly uh regular expression which you don't have to copy and paste into your configuration just use scl and you and use it to find uh credit card numbers or um many uh similar uh features are uh in scl implemented in scl next there are some uh global options how uh which affect uh all of the rest of syslogang many of these uh features can be uh many of these settings can be overridden uh in different parts of the configuration for example uh if you don't if you have a uh low traffic uh server then uh flash lines zero means that uh each log message is written to disk as soon as it arrives but uh if you have a um also have a um smpt server with many incoming email messages then uh you might want to um increase this value to a larger number to make sure that that uh performance is uh logging performance is not affected but only for uh the given destination and not for the rest the next is uh defining sources where you are collecting messages from the first one is for uh local messages system uh the system source is uh the solution to hide hide away differences between different platforms so if you have uh linux machines with system five and uh system b you have free bsd you have solaris and so on you don't have to keep track the system specific uh uh log sources but use uh the same con configuration on all of the machines as uh system will find the right uh log source on all of the machines you have internal is uh for syslogang's internal messages uh in most cases it's it's okay to uh collect these together with uh system logs but uh in some occasions it's better to uh lock them separately uh next you can see uh a network source it's in this case it's udp listening on all IP addresses of the host uh on port uh five four five fourteen the next step is to configure some destinations at the top of the screen you can see a five destination in this case var log messages the other one uh is more related to big data it's an elastic search destination where uh you can set the index name uh cluster name and the template how you send uh your messages in this case it's jason uh is jason template and uh the legacy syslog format is uh fields from the legacy syslog format are format for forwarded the next step is uh to configure some uh filters and parsers the first one is uh a filter which uh discards uh debug messages and lets do the rest the second one is typical for var log messages and you can see here uh that many different filtering possibilities are used filtering out debug messages uh net allowing uh mailing related messages and so on and all of these are combined with uh uh boolean uh operators at the bottom of the screen you will see uh how a parser is defined just uh add button bb and uh the xml file which you use for uh describing your log messages and here comes the most important part of the uh configuration it's the log path where uh you connect all of the uh building blocks together the first one is a typical line for var log messages so it reads uh the system messages uh applies the filter uh for uh which i showed on a previous screen and uh stores it to a file the next one is a bit more interesting it's the destiny it's the log path for elastic search and uh you you will see that we utilize both the local log source and the network log source filter out only the debug messages use uh the pattern bb parser on the uh remaining messages and then uh store everything into elastic search and uh here you can see a screen from kibana and so you can not really read it but you can see that uh all of the uh data rise it's worth to do the configuration of the data rise safely into elastic search uh here are some graphs for the distribution of uh priority and facility and in the upper right corner you should be able to see uh some uh results uh coming from pattern bb it's a top list of uh source IP addresses parsed out from uh SSH login messages uh using pattern bb at the bottom uh you can see the distribution of uh log messages over time and upcoming and very interesting technologies Kafka it's uh published subscribe uh messaging uh and uh it's becoming more and more important in data driven uh organizations it's like a uh data backbone syslog ng can uh send messages uh to kafka and uh we are also working uh to uh implement uh kafka source into syslog ng so it so that way we will be able to also uh collect messages from kafka and not just send to kafka finally i i would like to summarize uh the benefits using syslog ng in a big data environment first of all it's uh high performance and reliable log collection it can also uh greatly simplify uh your uh data architecture as uh a simple a single application can be used both for uh system log system and application logs or just about any uh application data can be uh forwarded using syslog ng and it can also significantly uh lower uh the load on the on the destination on the processing side as uh syslog ng can uh process log messages close to the source in a distributed way and uh it can uh parse uh messages find uh forward only the important information and also format uh uh these messages to be uh ready for processing uh and uh everything can be uh done on the syslog ng side uh if you would like to join uh our community use syslog ng have uh collect more information the main anti-point is syslog ng.org uh the source code of syslog ng is on uh github and uh if you have any uh problems you can report report it on github or we also have a mailing list uh and uh we are also uh on on the syslog ng channel uh on irc on freenode uh if there is any university students among you we have opened to any position and uh also uh we are creating the syslog ng universe it's uh something brand new uh more information is on the website so you can get uh small exercises which you can use uh in your programming assignments uh and we give you uh we can give you all the help uh to implement something uh and uh we also can merge it into syslog ng so uh it's good for us as we get some new code and it's good for you as uh you you learn something how to do and you can also uh point uh your features uh that you created something do you have any questions uh syslog ng is uh in practically all you know distributions i know uh yes but not uh not always the latest version so it's in fedora it's in uh susan it's in uh debian yubuntu gen two uh arc arc linux so practically all of the major linux is uh also uh available for the tantal price it's in free bsd open bsd there are packages for so Larry always uh only the question of time uh how updated these packages are so if there is a release uh right before we have a release then they will carry and all the all the syslog ng uh for a while but often there is a there is an external repository which carries the latest syslog ng for the given linux distribution it's the default in some and mostly it's optional one yes it works in both sides it's it's yes it's the absolute level same you only have to change the configuration uh it's uh it's um high availability is not built in but it can work together with load balancer load balancer it can work together with uh any high availability solutions yeah sure if you have any further questions uh or not now but later you can reach me uh by email and i also have a blog uh where i regularly post information information updates uh about syslog ng interesting use cases and so on and uh i don't know if it's carrying you away or uh what is the effect here is a sample xml file uh it's the one i used for parsing ssh messages uh at least part of it where uh at first it looks ugly uh as usual but uh here you can see uh the actual pattern used for parsing the message uh you see that uh it's the actual log message and where the variable parts are we have some parts are inserted uh into the machine uh so you can see the field names the parts are named the field uh and the field names uh which are used to define the message and here is an example message so you can verify immediately if the parts are you created is all right and at the bottom you can see that based on the message message text some additional fields are created uh like that it's a login event and that is here it is accepted so it was a successful login login actually there is an application uh it's called ELSA it's uh enterprise log search and uh archive uh which uh does exactly this that uh it has many built-in parsers for different idea systems like uh snort uh bro and so on and uh parts are for uh ip tables sysco and other each unifer and other firewalls so uh and it can also store all of this into mysql and index the messages uh so you can easily search any uh ip address uh in the database uh what happened on your network it has some uh built-in tools for looking up information can call who is uh on the ip addresses or do many uh similar kind of magic so it's uh used uh often in uh the security part of network operations center sorry uh ELSA enterprise log search and archive uh with raw data uh you don't really know except oh you can know if uh if you use gip in that case uh you can create additional fields from the ip address uh either at country level or at city level uh and uh add this information to your log messages and store it together with your log messages so uh i think this is the same or what is used in as well log search and archive and archive yeah no uh archive sorry i'm not native english yeah sure uh that was the original purpose of syscogen g uh order if it's called next generation but actually it's 18 years old so we can remove but only if the uh the same message appears right after the other by the way it would require uh two-bit memos buffer you you won't do and the bn packages are and also uh my open to the packages are uh in the build service open to the build service uh if you dot me an email uh i can uh provide you with uh your i i think it's also on the syscogen g dot org website links to all of the different package sources any other questions uh yes i already sent my slide to scale so it should be on the website too thank you for your attention two countries where there's one person so there's one person in canada one in finland uh no wait two in brazil now um well avi who's the cto and original um author is in is in headquarters in israel but uh the core developers are all over the planet yeah is there something else going on right now i mean oh robot workshop okay okay so uh so yes it is yes it is it's from the yeah it's from it's from the odyssey you got it uh which one of the creatures was it it was the the the one with the tentacles that would uh that would grab people off the ships yeah yeah yeah yeah no it's all right well we might as well get started if anybody wants to move up there's plenty of space at the front we can uh hear everybody so i'm i'm don marty i'm with uh sila db we're coming up on our g a release so uh before i get started um who is already running apache kassandra or have you run it in the past okay okay okay so not in production all right so other um other no sequel systems my um mongo okay okay so um hadou uh react any of the others reddice okay reddice okay big time reddice and spark okay and then is the back end for your spark for for the storage okay okay okay okay so yeah spark spark sila turns out to be a really good fit so some of the um integration work that has been done for um hooking up spark and kassandra uh just plugs right into sila so ignite i don't know about um i know i know spark for sure because we just put up uh or we recently put up an article about that i don't know about ignite but i can check on that okay okay all right well now that i have a little bit of a sense of where we're starting from um sila is a new no sequel database and it is uh the claim to fame is that it does about uh 10 times the throughput of apache kassandra uh while um maintaining kassandra compatibility um and with a lower latency in general um and sila is open source uh founders um avi kaviti and dora levar are the uh two uh company founders it is a commercial open source uh project um they're uh well known for the kvm hypervisor which is the basis of the google cloud and many other um uh cloud deployments and the uh osb unicolonial which is an early full featured uh unicolonial project which is also open source and and still out there um sila we've got people in uh 10 countries around the world we are hiring i don't know if you saw the jobs board but we're looking for c plus plus developer qa solutions engineer um a bunch of different uh roles within the project in the company so um what makes sila so different from any other uh no sequel database well the design is completely different it is completely sharded per core um everything that has to um touch memory for example um is uh only um working on an individual core uh cores communicate by message passing so there's a send and receive q uh for each pair of cores on the system uh no lock throughput is uh depending on um the system you put it on um a sort of mid range to upper mid range system will give you about uh one to 1.8 million uh cql operations per node cql is just the kassandra uh query language so we go by by how many cql statements we can we can uh execute in the second um the 99th percentile latency is consistently low um this is probably um not surprising when we get into more about the the architecture of how this thing's laid out um and then the feature set it is compatible with uh apache kassandra so here's a quick uh graph of throughput the little green bars down here um we've got the the uh right throughput uh the read throughput and a read write mix uh the little green bar is kassandra 2.1.9 um and the purple bar is silla um on the same hardware so this is on this is running on bare metal uh with an ssd and of course i don't know if anyone went to uh brendan greg's talk he points out that benchmarks may or that that most benchmarks are bogus so it depends on your individual configuration but uh you can you can generally expect to get a order of magnitude throughput improvement um with silla compatibility you can very often take the same workload same number of clients put it on fewer servers now um as far as how much of kassandra did silla copy over directly and how much of kassandra did uh did silla uh completely re-implement um the stuff that makes it easy to switch back and forth between original recipe kassandra and silla is is all the same so kassandra uses the ss table file format on disk and the silla file format is actually identical so if you are migrating from a kassandra server to silla server you can simply shut down kassandra on a node uh copy the files over and start silla pointed at those same files there's no conversion uh that needs to happen now of course kassandra and silla may make slightly different decisions on where to split up those files kassandra will probably uh produce fewer and larger files and silla may produce a larger number of files because of that uh per core sharding um but either system will be able to read in the files that were written out by uh the other one if you go to the silla site you'll notice um there is no uh driver download uh section that's because all of the drivers for all the common languages out there including c java python ruby everything that has a current kassandra support that same driver will work with silla the wire protocol between um between the kassandra client and the kassandra server is implemented exactly by by silla uh same goes for the cql language the query language is the same um from uh kassandra to silla uh the config vial uh some of the entries in the config vial refer to um details in the kassandra implementation that uh silla will simply skip and and have no effect but the same config vial format will work on on both systems so you can copy your kassandra dot yaml over to silla dot yaml um and start up silla uh and it'll work and it'll use the right ip addresses and ports and and um and snitch configuration for for communicating one knows um management uh yes it it is managed with the same jmx endpoints now this is not um because silla is a java program it's because there's a separate uh jmx component that starts up and will translate between silla's own native uh rest api for management and um the jmx endpoints that are provided by kassandra so uh we're actually filling in a few more of these jmx endpoints as we go along so any third-party tool that you're that you're able to use to manage kassandra you'll also be able to use that to manage silla um code code is all the same you can copy the kassandra uh database is the same and uh as i mentioned before the integrations for other tools like uh like spark uh will will work with silla as well and this is just the the details on the versions of um kassandra that the that the uh the current version of silla is implementing so the cql language uh version 3 the ss table format is the same as kassandra 2.1.8 uh jmx uh that's also the same as uh 2.1.8 um the configuration file um um is is the same now kassandra when they first started up they had an older protocol called thrift um for um for storing query and data um the kassandra thrift protocol is not implemented in silla so you would need to use a current kassandra driver that uses cql um at some point um we will have thrift support but uh almost all of the new kassandra applications have been cql for quite a while now and so that's where uh where silla is focused uh yes question in the back the um the the missing thrift um that would affect you if you had developed a kassandra a kassandra application using a very old driver that uses thrift so if you've gotten your um kassandra driver within the past couple years when everything's been focused on cql then uh your cql based kassandra application would just work it's just some of those like kassandra 0.9 uh vintage applications are not going to are not going to run with silla yet yes um kassandra is a distributed no cql database and um on the at the very basic level it's like a distributed hash table where um data is given a persistent identifier and then it's stored on multiple nodes around the ring and that um that type of uh uh or or that level of the system where you can you can choose uh how many copies of a piece of data are stored um is tunable so you can actually tell kassandra um i want uh to store um copies of this data on three out of the five nodes of this ring um but when i do my right i want it to come back when one of them is written and then kassandra has to take care of replicating it itself or you can say i don't want my um insert statement to return until i know my data has been persisted to um a quorum of the nodes where it's supposed to be so one of the things about developing for kassandra as a as a developer or as a or as a systems architect is you can decide how you want to balance out speed versus um safety so you could you could for example have certain items of data like like some cashed data from a user session that you can regenerate um you could you could take some data and run it um very fast but with fewer copies uh or you could say um this other set of data is um something that i can't easily regenerate therefore i'm going to set the the replication um and the the consistency level for that data um to to need to be on more nodes so uh copies all that exactly exactly from kassandra so so if you've if you've done the process of going through and reasoning about your data and deciding how you want to trade off performance versus safety then all those same decisions will be reflected exactly in your solo configuration yes yes yes yes there's there's a um there's a repair functionality that will let you repair your ring when you when you replace a node there are um different uh levels uh that you can apply when querying your data as well so you could it you could say i want my reads to only come back if i if i'm reading a quorum of all nodes or you could say well i just need to hear back from one node about this particular piece of data so so because the the kassandra design is um is possible it's it's possible that you that you could misuse it you could build something that was both unsafe and slow if you really tried um but if you go through and figure out the the possible failure scenarios and how you want each piece of data to be protected then you can balance out and get the right level of speed versus safety for for all the data you want to preserve for your application okay okay if there's something that you would cry if you lost then look at quorum so your your right will not come back until um a quorum of nodes has has a safe copy of it yeah yeah for kassandra training so you could it's a good idea you're probably not gonna yeah yeah it's the same the the apache kassandra project has um a lot of uh interesting um discussion of um of different different data models um there's also a regular kassandra meetup so if you if you go to the the kassandra meetups very often there will be a talk from somebody who has gone through exactly the kind of process we're talking about of saying well here's our initial data model that we came up with and then we decided that for whatever performance or safety or ease of administration reasons we went through um a couple different versions of it oh that's um we we have done some um but reporting back to the kassandra project and there are um solo people on the kassandra list and um i think vice versa so there it's it's all yeah it's it is kind of symbiotic and it's all it's all open source so if you like if you like the kassandra model and you like the drivers then um it's a it's a uh yeah yeah um we are going to offer training um but but not yet right now we're focused on ga so if you if you're looking to develop applications for kassandra then yeah there's some there's some good tutorials there um there's there's an orielly book on kassandra out too but i think the orielly one is kind of old um and um either a press or package press had had a good one out um more recently than the orielly one um so but but the the good news is that there's a lot of good information about it and if and if you go to stack overflow there's there's some good discussions of um different kassandra options as well but if you can't get something oh yeah yeah you can if from an open source point of view you can always get on our mailing our user mailing list and ask um and then there's also a support line that that uh you'll you'll be able to get on and get a support plan too if you're if you're interested in that uh for now um the user mailing list has all the main developers on it too so you'll you'll get good answers to uh any psila user questions as well okay all right thanks oh um gossip is the way that both kassandra and psila share information um about which nodes are up which nodes are down so psila um copies the same um general principle for communications between nodes of the cluster um however you can't take a psila node and add it to a kassandra cluster or vice versa the the wire protocol um of what travels between nodes is a little different and so yeah you would you would need to replace all the nodes in a cluster at once in order to do the migration each individual node is easy but you can't do one node at a time now um kassandra does have the concept of uh and and so uh both have the concept of data centers where you might have a duplicate of the same data in different regions for resilience um and so in the in the near future there's going to be an option of having a um kassandra data center and the psila data center uh that that cooperate but that's that's a near future thing not something you can do right now one of the things that that uh some sites do when when they experiment with migrating across uh no sequel systems is they will do uh dual writes so you would you would be writing to um both systems for a while um and then cut over at some point so that's that's fairly common in in large scale no sequel situations where you can't afford to have have something go down but we are going to do the the um the translator functionality soon okay so how does it work how is it possible to uh do all that stuff but but so fast well um this is uh lady ava lovelace and uh before she even actually had a computer to uh program on she came up with this which is surprisingly uh surprisingly accurate how multifarious and how mutually complicated are the considerations which the working of such an engine involves there are frequently several distinct sets of effects going on simultaneously all in a manner independent of each other and yet to a greater or less degree exercising a mutual influence to adjust each to every other and indeed even to perceive and trace them out with perfect correctness and success entails difficulties whose nature partakes to a certain extent of those involved in every question where conditions are very numerous and inter complicated how many have dealt with that yeah so most of today's server software was designed back in the day when you had one core per processor and smp systems were out there but it wasn't um so big of a thing even your your typical mid-range system was was uh a single core today excuse me um today you can go down to well maybe not target but certainly fries and buy a 10 core processor you can click on a cute little web form and get a machine with 20 or 28 cores by the hour so the difference in how these these systems have to be coded for is is remarkable instead of the processor spending most of its time doing actual work when you take a an application that has thread pools or a lot of contending processes at the os level the cores are spending most of their time contending for locks and and getting in each other's way so um what sila does is let's each core run flat out by giving each core its own dedicated set of resources so in a classic in a classic multi-threaded application you've got a lot of threads going on and the threads need to make system calls into the kernel and of course when something wants memory the kernel has to the kernel has to handle allocating memory to many different threads running in the same process and what sila does is sila is based on a thread per core architecture called cstar and in cstar every every core has its own chunk of memory to be allocated and so when one process wants to to or when when one core needs to allocate memory it doesn't have to take a lock it can just get memory that that's dedicated to that core for writing to the disk it isn't um it isn't a multi-threaded application all of the writes in sila are done with linux asynchronous i o now what this means is that sila becomes um sensitive to the performance of the file system so the performance of the database is no longer the the issue and now it's how well does your file system handle asynchronous i o and what that means is that the kind of performance results we're seeing with sila will show up with xfs which happens to have excellent asynchronous support but you're not going to see the dramatic high sila performance with file systems that don't have as as solid support for asynchronous i o now i said something a little bit earlier about the speed of development that sila was able to achieve taking a fairly complex no sequel development project and making it happen in a year and this is this doesn't sound like the kind of thing that's possible to achieve with a um thread per core architecture and every core having to communicate with every other core by message passing that sounds like a really hard system the big win from the cstar framework that sila is based on is it gives the c++ programmer a set of abstractions to work with to make it possible to develop with that model in a reasonable amount of time with a high quality level so um the the constructs of futures promises and continuations make it so that the programmer who's who's developing sila can work with can can work with with functions that return values that you can reason about rather than just handling message passing across cores by hand now a lot of database tuning is based on tuning how linux handles the page cache for for disk i o and sila has taken the approach of instead of relying on the os to um to cache um blocks the sila design has a built-in cache and then everything that goes to disk is asynchronous i o and what this means is that in instead of in a conventional layout where the database might modify one row within a 4k block and then the os has to spew that whole 4k block out to disk in sila it would be modifying data in memory and then when that cache gets written out to disk sila has application level knowledge of the cache and can just write out um the data that needs to be written out so there's no such thing as a parasitic row which is a row that needs to be or that a row that that doesn't need to be written out that kind of tags along with one that does oh okay yes yes yes the the ss tables are written in a different way they're written using asynchronous i o they're using a more efficient mechanism but the bits that end up on disk are the same as uh what gets there from the the old school kassandra way yeah yeah yeah well if you look at how did what if you look back in the day how did linux get asynchronous i o in the first place that was oracle developers saying we need this feature so linux linux didn't have asynchronous i o to the best of my knowledge until the oracle people were working on porting their database and they ran into a very similar issue so a lot of the high performance techniques in sila are things that have already existed in operating systems and in relational databases however they haven't yet been applied to no sequel so there's there's some computer science wisdom of the ages that's being dug up and and reapplied to to this kind of task as as well as as well as some of the more innovative um sharding and um and the futures promises continuations model there's also some well let's go back and look at at things that that oracle had figured out in the late 90s early 2000s um and and other relational databases have also taken advantage of oh yeah oh yeah sure sure um well repairs repairs can be faster because when when your round trips are significantly faster because you're not you never have a java garbage collection for example um when when instead of having some um transactions face a a high of latency then if everything comes back quickly then a complex operation like a repair is going to complete more quickly as well and in a more predictable amount of time okay so this is the simple version if you need a pony you get a pony okay so i didn't know about the 1024 by 768 requirement i can show you the the sila um cql requests served per second this is a um this is a um uh system that's doing um about uh two million requests per second um and you'll notice it's it's consistently high and this is across a three node cluster by the way two million two million across a three node cluster and there are details on this benchmark on the on the sila blog um kassandra uh it takes a little while to come up to speed and the overall throughput is is lower that's just the basics now the interesting metric here is latency and this graph is a little complicated the throughput um is here so uh sila is doing over 250 000 cql um request per second um and kassandra is doing about uh 150 000 so so sila is running a little a little bit higher throughput than kassandra in this scenario and then the the y axis over on the other side this is um latency in milliseconds okay so mean latency sila is a little quicker than kassandra or substantially median um still uh substantially quicker um 95th you can see a little bit more in the 95th percentile um for kassandra latency and but once you start to get into that 99 99th and uh and 99.9th percentile you start to see these over uh five millisecond latencies on the kassandra side while sila is still um consistently pretty low um so this is the latency of the slowest uh 0.1 percent of requests the slow the slowest of the slow the ones that happen to hit a um a slow path in the in the system it doesn't mean that one box is slow it means that if a box is serving a million requests then it'll serve a thousand slow requests yeah um there are a bunch of these um that that actually the sila developers are running all the time so the ones that we make the graphs for tend to be the ones that are that are more representative um yeah yeah yeah yeah and and we do them on different um on different infrastructures so um there we've we've done some on um different different uh bare metal configurations and and on uh amazon so what's sila written in sila is written in c plus plus 14 so it's the first uh database that we know of to be written in the latest version of the c plus plus language yeah and and the the horizontal lines go with the numbers on the right side and the bars go with the numbers on the left side yeah yeah well it's not necessarily because of uh low latency that the throughput is high it's possible to have a system that has very high throughput and very high latency and so as the throughput goes up then the um latency also tends to go up in a production system so yeah you have to you have to balance them and in in this case it's it's a test where um Cassandra actually got a little advantage because it's running at a a lower throughput and and being compared straight up on latency and this is just an example of a quick repair this is using the node tool command for Cassandra um and it starts a repair at uh 17 33 44 um and it's done at uh 17 34 10 so we'll have we'll have some more detailed benchmarks on quick repairs too so once you have the speed what can you do with it and and it's not just a matter of paying a smaller bill to your uh cloud provider although that's nice too um you can shrink the overall number of nodes in the cluster you can make it faster to uh scale your cluster up and down for uh big events such as black friday for retail or um big uh for example if you're doing an online game and you have a a big release of an expansion pack or something you can and a lot of people are playing you can ramp up the number of of nodes for your game um you can do more with modeling your data instead of simplifying your data model uh on a performance driven basis Cassandra lets you do a lot um and so I'll let you do the same stuff but you can you can do more interesting designs of your of your data model um and operations like repairs can happen without bogging down the the cluster there's just extra performance overhead to play with which means that administration becomes a lot less painful um so what's that um you can because you have the extra performance you can use a more sophisticated data model sometimes in designing a Cassandra data model uh you end up having to denormalize things and make extra um tables just to handle specific queries uh with sila you still have to do some of that it's not it's not going to do your computer science for you um but you have more options in in how you design your data different different data models can be acceptably fast um no not not necessarily necessarily but because you have more performance you can either use that to have fewer boxes or you can have you can have um uh more performance overhead to use for things like running your repairs online or um doing more uh sophisticated queries on the same data and so that just means that you can chill i saw i saw this building and and i had to take a picture of it because uh this is this is the date this is the data center where people who have simple no sequel databases can can go it's not the it's not the stuff that has to be tweaked and babysat constantly all right so what's coming up next with sila well we're looking to build a community people who want to do stuff that plugs into sila are especially welcome you do not have to be a c++ expert to get into it um we have some devop stuff that we're working on and and um looking for um anybody who wants to make something interesting and and have a good uh no sequel data store for it uh of course the core the core database continues to improve um we're uh already integrated with Apache Spark but of course there's a huge stack of stuff that will connect to Cassandra and we're hooking all that up to uh sila as well including management tools um and then there's the possibility of having sila also implement more protocols and not just the Cassandra protocol so if you have ideas there um then please get on our mailing list and we can discuss that um stuff that we just come up with in the past couple releases we're doing a beta release over a few weeks now past few releases have had a new official AMI this one's using CentOS 7 we recently introduced SSL support um and the altar table command so you can change your schema on the fly um sleep mode this is good at the at the very um first few releases of sila we had um a a polling mode where every core is running flat out waiting for uh more inbound traffic this is great for speed not so great for your battery for your developing on the laptop so sleep mode will now let uh sila uh go to sleep and wait for you to run your tests on your laptop again i'm especially happy about this one because now i can just turn my laptop around and do do my sila um development and test on it uh we have a project that is not actually even part of sila it's called Cassandra test and deploy um that is based on Ansible it's a way to easily spin up a bunch of Cassandra nodes um and test different uh failure scenarios um on them um or of course uh sila nodes so anytime you want to test uh uh sila against something else or just try something out on on Cassandra then check out the Cassandra test and deploy project oh it's on it's on github um go to go to github sila db and all our projects are there and you can you can find our github page from the from the main website there's a very simple contributor agreement that we ask people to accept in order to get code into the main um um sila db uh project that's um just to um say what you wrote and uh we can take it and do our stuff with it uh we do commit to it being open source in that agreement um the way that the list or the way that the development process works is uh the code is hosted on github and we use github issues um however we do not use github pull requests so if you have a change that you want to discuss and and have the the team consider for for for acceptance into sila that would be something that you would you'd post to the mailing list and we have a document on how to make git do that automatically so the the linux kernel developers have come up with a pretty slick system for taking a git branch turning it into a series of patches and and posting it to a mailing list and we use all that exact same stuff it's it's essentially linux kernel style development and as i mentioned before you don't have to be a c++ programmer if you're into python let me know we'll do some dev ops um i'm building out from from demo scripts to more um deployment and uh and easy ways to manage my my sila um projects or projects that i have that are on top of sila where i need to to deploy sila along with everything else and there's a knowledge base so if you if you want to write something up for that let me know and and uh we'll uh we'll see what kind of projects you come up with all right that's about it um any questions yes well spark is more of a distributed processing framework and then um spark can connect to um many different uh sources of data to be piped into it and and processed with with great parallelism across many nodes so um spark is commonly used with um commonly used with adu commonly used with kassandra and now it can be used with sila as well so you'd have your data in sila let's say you had your let's say you had your users um playing games and and uh rating each individual game then all those data on who clicked four stars or five stars or one star would get spewed into sila and then you want to do your reports on which games are good which versions made users like the game more or less than you could do those reports through um through spark just as an example yeah yeah well um kassandra has the cql query language but it's a really basic query language and then you you would take the results of that and feed it in through something to do more sophisticated processing on it yeah yeah you could do you can do the same thing and if you look at a lot of the sites that are doing big volumes of data in in hadoop are um getting a lot of data written into kassandra because kassandra is fast for um for writes and then they will take it out of kassandra for processing in hadoop or they'll feed it out of kassandra through spark and so um sila fits into the the kassandra role in a situation like that um of course it's faster so you're you're likely to get your your results quicker if you hook it up to spark so it's it it fits in the same way if you're interested in replacing your hadoop clusters then um then you might be and and it completely depends on how your application is is designed that you you might do sila and then spark to process the data is the kassandra the kassandra functionality is just a database it doesn't do all the the distributed processing that that hadoop can't but but spark does distributed processing um okay yeah well kassandra yeah well kassandra would be the the data store that either feeds into spark or that you take your spark um results and feed them out into out into some kind of a data store as well so spark does distributed processing and kassandra and sila do distributed uh database well hadoop has hadoop has a bunch of separate projects within it right so hadoop has um hbase which is a database and then hadoop also has distributed processing aspects to it so hadoop is a big stack of stuff um some people have found that instead of having the whole hadoop stack they can get um faster and simpler results by having um a kassandra database and then spark for distributed processing and so sila drops in as a replacement for kassandra in that scenario and since spark is known for being fast and sila is known for being fast it sounds like a good combination yeah yeah you'd you'd still need something for distributed processing so so spark is is the hot thing now for distributed processing yeah yeah yeah yeah and and sila should be able to get through 8 to 11 terabytes of data pretty quick so uh of course depending on your hardware and your data