 Hello? Hello? Okay, yeah, there it goes. Okay. Hello? My name is Spencer Crum. How is the volume? Seems like it might be a little too loud. I don't know if anyone agrees with that. Okay, my name is Spencer Crum. I work for IBM. This is a talk called secure peer networking with Tink. The slides are on my GitHub, and they'll be part of the scale slide spammage that will come out, I'm sure. Can the people who are on the left side of the room do me a favor and like just kind of shift to the right? Because you know, when someone opens that door, they're just going to try to sit down right there. So just makes it easier for everybody. So I work on, at IBM, I work primarily on OpenStack. I also work with the public community. We renamed GitHub organization to Vox Pupuli. That's where a lot of my open source contributions come from. Specifically with OpenStack, I work on the infrastructure project, which means we run the CI system for all the OpenStack. So sometimes 20,000 jobs a day goes through our CI system as well as code review and stuff. I am from Portland. So when I travel, I like to show people what Portland looks like, Portland, Oregon. I know a lot of people think rain and other problems, but we're actually very beautiful city with a pretty epic mountain in the background and more green trees than Californians can even conceive of. This is a talk primarily on Tink. It says that consoles in it in the slides, which is true, like consoles in it. But if you're here primarily for console enlightenment, this probably will not bring you to glory, unfortunately. So the two pieces of software we're talking about here are Tink and console, and these two pieces of software kind of follow an inverse proportion between usefulness and website awesomeness. So if you look at the console website, there are CSS3 transitions, there are JavaScript starbursts flying around, and it's actually kind of hard to use. If you look at the R-Sync webpage, it's got H1s and H2s and divs, and R-Sync works great. Similarly, the Tink website is made probably in 2004. It's got a PayPal donate page, and that's the only image. It's that kind of low-key. The logo itself is also there, but I'm not even really sure what the copyright status of that image is. I'm not going to comment on that. So this talk is mostly about Tink. It's not that much about console. So Tink ultimately is a virtual private network system. This is the obligatory noisy picture of a VPN. We have some classics like the firewall, or yeah, we have the wall that is on fire. We have the mainframe servers, the strange blue, I guess, film reel highway thing or something. But Tink isn't really like your open VPN. It isn't like your Ipsac VPN. It isn't a point-to-point kind of thing. Or a traditional VPN, you have a fleet of laptops or something that all connect to one or maybe two in a high availability pair of VPN servers, and then get access to a network behind that. Tink is a mesh. It's a distributed mesh with routing. So a node will come up and connect to one or any number of nodes, and then have access to the entire network. And we'll look at some kind of graphs of what that looks like in the past. But in our setup, the setup we have, every node talks to every other node mostly. So some fast facts about Tink. Just some quick information. It's from the era of this cell phone. They started working on it in 1998 in C. It's still going strong. It's got two developers, most who have written most of the code with a lot of drive by patches from various humans. I am not one of those developers. I'm just an enthusiast. So it's fun to just some, you know, the whole talk really is about this thing we dusted off. I've been living in Debian repos for a long time and started getting some use out of it, and I figured I'd talk to scale about it. It behaves just like a Unix demon. You don't need a kernel module or anything. You need the TUNTAP driver, but you don't need anything special beyond that. Their development is done kind of weird. So it's kind of the classic old-school pre-get hub development workflow that's been modified by GitHub. So if anybody works with Apache, you can still mail a patch to most Apache projects, and they'll apply it, but you can also open a pull request, and they'll merge it, and there isn't really good standardization around that yet. But there is a dev mailing list that's very good. There's a free-node channel that's really good, and Gus, who's the guy who writes most of the software, is very responsive on both forums. We looked at using Tink internally when I was at a last job, and what we found is that for bulk file transfer, we lost about 15% of performance when we're moving our stuff over Tink. So that's there. It's also worth noting that the SSL has not been independently verified by other people, so I wouldn't bet the farm on it without spending a fair amount of time looking at it. But it's old, right? It's old. It's got port 655 and Etsy services, so that's pretty cool. All right, who remembers this, huh? The network is the computer, and I'm not even really sure what this is. It's like a mouse pad or something with a handle, and this guy is really excited about his business flying into the computer age, but this is kind of an idea. So when I was in college, we had an old network that had been written in this era and never changed, and it was a very special network and not really modern, but it had some really awesome things about it, right? Like all the machines, all the printers, everything had a public v4 address, and everything, you know, there wasn't really a firewall. Everything was just kind of connected, and this was kind of weird, but it also meant that the distinction between a server and a client was a little more muddled, a little more vague, and so you could just write a little daemon and start running it on some Linux lab machine, and other people could start using it, and the next thing you know, it's like a service, and people depend on it. Any node could hit any other node, it's fundamentally flat. There wasn't a lot of firewall rules, there wasn't a lot of NAT rules that prevented these kinds of connections, and we've really gotten away from that, like everything now is HTTPS, it's the only port, we don't even need the others, and that's kind of lame because if you want to share files or have a conversation with someone, you've got to do it through some abstraction layer built on top of HTTP, which is really annoying, so if you want to run your own services, there's something like own cloud, but if you don't want to do that and you just want like a file transfer protocol, it's basically out the window, you have things like Dropbox and Google Drive, but then you're giving all your data to them, and it gets kind of, not the greatest thing, so my friends and I were like, what can we do? We found this tink thing, we started dusting it off, and we said, we're going to build ourselves a mesh network, and see where it goes, and so what that meant is that most of us, you know, five or six of us installed the tink software and started the daemon, and configured it so that we would start connecting to each other, which is probably different than 99% of tink deployments, because 99% of tink deployments is somebody who wants to build a VPN for their servers that they control, or their network that they control, or their laptops that they control, so it's one administrative domain, so we have a different problem because we have six or seven administrative domains, and people, this is all free time, hobby, project, type stuff, so sometimes it's off and nobody cares, so we'll deal with it later, right? So, yeah, there's a blurry line between the workstation and the server, we run it on laptops, we run it on home routers, we run it on rack space gear that's just hanging out in the rack, and a couple of random places, so let's look at what the network looks like, this is a picture, an actual valid picture of tink, so tink is kind of a cool daemon, it will dump the network state as a dot file every minute to a special place in var, which is kind of cool, and then you render it with idea or something like this, and so we can see, I've anonymized all the names to be just numbers, but we can talk about the graph still, so the easiest thing to identify is that some of the nodes are on the bus, and some of the nodes are not on the bus, they're just not connected, and so what that means is that those nodes which were tink daemons were connected, and my tink daemon has not forgotten about them, they're just not connected right now, and so if they come back, we know who they are, they're not going to come back with a unique name or a unique number, you also see that there are maybe two classes of nodes inside the graph, right, there's nodes like 1 and 10 that only have one connection in, and then there's other nodes that have multiple connections, and so the way tink works is that you only need to connect to another node that is part of the mesh, and then you have full access to everything else, if you have multiple connections, tink will automatically kind of pipe it, sends packets back and forth, it tests them to you, and it texts transfer times, so that it can find the most optimal path to send your network traffic down on its way, and so it's self-recovering, so if we lost six, or if we lost two we could five, nine, six, instead of five, two, six, or whatever, and it solves, so it routes around failures, it continuously prompts for efficient reps, and every node essentially is a daemon, and that daemon is responsible for two things, it's responsible for a subnet and an IP address, and really the IP address is kind of optional, so in a traditional open VPN, if you connect up to the open VPN, you get issued a 10.something IP address, and that's your IP address that is your source IP address, as you hit every server inside the network. With tink, when you connect up, the node number one is assigned a subnet, and as an operator, you choose the size of that subnet, so what we did with my friends is we took a slash 16 out of 10 space, so 10.0.0.0.8, and we sliced 10.11.0.0.16, so that's, I don't know, it's a lot of addresses, I think it's like 16,000 addresses or something like that, and we decided that the entire tink network, which is flat, would be represented in that slash 16, and then each node, we assigned a slash 24.2. Then we found out that some humans, so that was when we were assigning a human, a node, and a subnet, and then we found out some humans run more than one computer, and so each human had a slash 24, and they started slicing their slash 24s up, and creating nodes inside that slash 24, so each human gets a slash 24, each node gets something, usually a slash 24 unless it's like a child node, so these, this node one and node 10 are, I honestly don't remember anymore, but these nodes are probably laptops that are connected up to a superior node, and they just have like a slash 31, like one address. People with me kind of? Okay, so making diagrams with open source tools is hard, I did my best, this was a lot of work, I'm sorry it's not better or labeled, or even circles, but you know, where are the open office devs? Take my money. Okay, so it has a concept of connect to, this is literally a word in its config file, so when you start a tink daemon in Unix, it reads its config file, and it looks at the list of things to connect to, and it sends sort of the tink version of sin, hey, I would like to connect to you, and they negotiate, and if it's authorized, they make a connection, so a node like this bottom node is kind of the classic laptop node, it just connects into the thing that it knows to connect to, nodes like this one are pretty well connected with both things connecting to them, and I guess you call that like a reflexive connection, or recursive, not recursive, but you know, it's a bi-directional connection, and again this is looking at sort of the connection path, once the connection is made and established, you have a graph that looks like this where traffic flows backwards and forwards across everything, so this, basically the difference between what makes up like a node like this one and a node like this one is that the node at the top probably has a well-known public IP address, which allows something like a laptop or something that doesn't have a well-known public IP address to connect to it, and so there's a couple different classes of nodes like I discussed, if you have a piece of gear in a data center somewhere, which some of us do just for reasons, because you know, nerds, that's a great node for tink, right, but if you have a laptop that's running around and conferencing and all this stuff, like it can have any IP address of the world, so it needs to make an outgoing connection, then there are things like in the middle where someone has like, I think around here the ISP is cox, is that true? The ISP is Comcast? What's the LA ISP? Charter? Joy. Okay, so somebody has charter and they like, you know, you have links and you port forward off to some well-known port, and so you use like DNS, DIN DNS or something like that. So that controls, and so for things with well-known IP addresses, talking well-known IP addresses, it's usually symmetrical. There are also places where people join the network at different times just because it's kind of a friend's social group, right, it's not actually administratively controlled by someone with like puppet or something who's doing it like right, so probably what happened is that this person was one of the earlier nodes, and so his or her key is published widely, and when this person joined the network, this person never bothered to update configs to connect to this person basically is what's happening. So how is Tink authenticated and secured? So the security layer is kind of beyond me, but the way it's authenticated is that every Tink daemon that starts up generates an RSA key, a public and a private key, and then you put, you know, obviously you keep the public key or the private key on the Tink daemon node itself, and then push the public key out to everybody else, and there's a directory called hosts in the config file which we'll show later, and that hosts just kind of has the IP address of the thing, the subnet it signs for, and it's key, and once you're connected, you're connected. So early on in the process when it was kind of figuring out who our friend group was and who we were going to let do this and stuff, we got to the idea like I want to let Alice connect to me, but if Alice adds Bob, I don't want to answer for Bob's traffic, I want some way to shut that down, and so we looked into it a little bit harder, and so at an IP layer, if you send a packet from one IP address on the Tink subnet to another IP address on the Tink subnet, you get no tracer out there, it just goes in and it comes right back out with zero hops. At the Tink layer, there's really no tooling to say I'm not going to allow from this network, I'm not going to allow from this network, there's no concept of like a time to live across it, and then somebody pointed out like pretty smartly that if we already allow traffic back and forth with Alice, like Alice can just NAT for any traffic that she wants, and so there's nothing I can do to prevent Bob's traffic from getting to me at that layer anyways. So we decided we'd calm down and stop by chatting it and just trust everybody. It's interesting, this isn't an overlay network like Tor or something where it's public and anybody can join, like you have to, someone would have to say Spencer would like to join your network and then give me a key and I'd give them a key and then we could start communicating once we configured daemons and restarted them. So when you get to daemon configuration and control, the Tink daemon is pretty cool. It's like, I don't know, I guess the word is old school, it's kind of before my time, but you control it with signals. So in this example, what we're doing is we're sending the user to signal to the Tink daemon and then we're just looking immediately at syslog and so when Tink gets that signal, it dumps connection state information as text into syslog and so you can get some kind of useful stuff like the IP address of the connection. So you can see this is a bi-directional connection and you can see there's weight. This is the example network I set up so that I could show you things so I didn't want to violate all my friends' privacy. If there were more connections, you'd see that they would all have different weights and those weights are used in determining which way we send packets out and again, Tink is constantly pulling those and making sure they're the same. You also have a subnet list, so you can see that Spencer has a SAS 25, you can see that Becaro has a SAS 24 and this gives you an idea of which hosts are available and which subnets. This is great for kind of just as an operator being able to say, I need the status command over here then look. It has a couple other things, so it has user one and user two, both dump informational information and int, the sig int, which is what you get when you control C, does not actually kill it, it increases logging verbosity to five, which is kind of cool. So you can, if things are getting weird and you want to see what's going on, you hit it with an int signal, you go read syslog for a while, once you've solved the problem, you hit it with five again, it goes back to normal logging, which is great, it's actually really, really nice, it's better than something like Apache where you have to go through a whole reload restart just to pick up more debugging information. However, if you're running Tink, like the first time you're running Tink and you've got it in the foreground and you hit control C and you would like it to stop and it gives you more things, that's not the best experience. It has two more control signals, it has sig alarm and sig hub. So sig hub does kind of what you would expect, it rereads the config files, so if you're bringing a new node in, hub will have your machine reread keys and things and so a new node that's connecting to you will be allowed in right away. Alarm is like that, but better, so it'll reload all configurations and try to make an outgoing connection to any host it doesn't have an established connection to. So you can add more connect to's, sig alarm and it'll connect over to them. I decided that this getting status, it was cute, but not DevOps-y. So, you know, I wrote a utility in go that has an HTTP endpoint, so you would curl at this 9,000 tink stat, you just run dot slash tink stat, right? And then when you curl at it, it would do the humping and it, or not the humping, but it would send the user to grab the log file and then parse the log file and then you dump it out in JSON and then from this you have a programming interface that you can start to make really intelligent things. Really all I wanted was status in my task bar, but you know, why not? Go big or go home, right? So tink right now is at 1.0, a 1.1 releases on the horizon and that has a couple of things. It has a tink info, so you can just type tink info IP address, tink info subnet, tink info, something, and it will dump relative information for that. And of course, as we develop, we can figure out what we actually wanted to type after tink info and make sure that that's supported too. Also, if anybody's used HA proxy, tink took a page out of their book and there will be a little control socket that you can kind of write, you can netcat things to and get things back, and that'll be a much better interface than what I've got here. Okay, how are people doing? Not totally lost. Awesome. So then we got to this phase, the Ian Malcolm scientists. Yeah. Okay, so if you have your friends and you turn on all these services, there's a few things that, once you have an overlay network with a flat IP space, there's a few things you can just immediately start doing, like you can just run Apache. You can use UPNP streaming just really easily. If anybody wants to VLC stream a movie, they can, right? It's, oh, are you over there? I'll just stream it to you or something like that. Which is almost more fun than useful. Also, for anybody who still plays StarCraft, like me, you can play StarCraft over this, which is pretty sweet. But, all right, so if you remember, we had Googles of information. We knew that Spencer's stuff is somewhere in this slash 25. So one of those 128 addresses is probably doing something. And one of these 256 addresses is probably doing something for Becaro's, but discovering that is super hard. So DNS. Oh, I need to know DNS. I want to go to files.spencer.com or whatever. And so we were like DNS. Okay. So we looked at what other people are doing with Tink. And so we did the first obvious thing, which is that we started a host file. We started copying the host file from person to person and modifying the host file. And that actually worked really well for a little while. For those of you who were around at the early days of the internet, I'm sure it worked really well for like a couple of years. Because while it's just being added to, that kind of eventually converges on maximum file size, right? But then once people start repurposing servers and IP addresses and turning services off, it becomes a nightmare that's way out of date and inconsistent. So, okay. Well, there was this thing they did after host files. They set up DNS servers. So we set up a DNS server. Somebody set up a bind and a couple of people set their resolve.conf to look at it. And that was kind of that because when one person runs a bind on their server, that doesn't help me at all. I can't push data into it. I don't know how to do his own update. And so the server sat there for a while and it got no utilization. And we were right back where we were using host files. So the bind server didn't work. And so kind of we stepped back a little bit and we said, all right, well, we have multiple administrative domains, but we want like shared mutable state. And it needs to, more than that, it's like, obviously we could just write to like a MySQL server or even like a DNS zone file or something, but that server is going to turn off randomly because this is all just weird hacker gear and Raspberry Pis and stuff. So we wanted shared mutable, highly available state. And we've already DevOpsed. So there's something that has shared mutable, highly available, kind of, you know, we could do hip stuff. We could do the hipster things and we could set up at CD. How many people know what at CD is? That's like a third of you. Okay. So at CD is software from CoreOS. It is an implementation of the RAF consensus protocol and is largely a copy of like ZooKeeper or Chubby if you worked at Google. And it's, the way it was described to me, the reason they needed at CD is because when you start Docker containers, they don't have a writable root or writable etsy. So if you can't write the etsy to file system, you can write it to the daemon. So it's the etsy daemon. I don't know if that's apocryphal or not, but that's a good way to think about it if you're not super familiar with it. This class of software along with Chubby and ZooKeeper and stuff is sometimes referred to as a distributed lock manager. So like on the local system, when we're in Unix, we write a PID file and that's kind of our lock. Like I'm running this. I'm already running something, you know, and that way you can check, your daemon can check the PID file before trying to reserve the port and getting kicked out. The distributed lock manager brings that to clusters. So you can have multiple machines and they say, hey, is somebody already running Nagios? And they're like, yep, somebody's already running Nagios. They're like, cool, I don't need to run Nagios. Which is good because I don't want to run Nagios. And so like when you look right at it, it's like an HTTP interface to this weird go daemon. It supports like one, three, or five, or nine etcd nodes and it replicates data all between them. It has a strong, the RAF consensus algorithm gives you a strong guarantee that if you write something, if the right is accepted, then that right is written and committed to. And if you read, you also have strong consensus. And the cool thing is that it supports the idea that you can have multiple masters and one of the masters can just disappear for a while and they'll just keep on chugging. They'll just keep on chugging. If and when it comes back, it'll get new data. And so we kind of turned etcd, which is, again, run usually by one administrative domain, we decided that we'll use this as the shared way of kind of sharing a dataset. When you look at it, it ultimately looks like a hierarchical key value store. So there's an LS command and you can LS the directory and like get keys out of the directory. Mostly I put strings in. If you want to know more about etcd, like you can put numbers, you can put IP addresses, you can put like counting locks like semaphores. And so we do the thing that kind of makes sense. We start writing like hosts. We make a host directory and in hosts we write names and in those names we put IP addresses. And it's like, it's not DNS, but it's a shared way to write that kind of information out that's not just like an etherpad. So we decided to do something kind of stupid. This picture, by the way, I just love this picture. There's so much going on in this picture. So like, what are they doing up here? Like this isn't like boxes. This is like some kind of a tank with a compressor. Is this a compressed like gas? Because if it falls, it's going to burst and like take out a wall. Okay. Well, what's he doing? Is he holding the forklift? Is that his job? It's still, I got this. It's not going to fall over. Is this guy supervising? He's not holding any controls. He's just making the system more unstable. And if you follow, if you follow all their eyesight, right? It's all directed at, you know, the interesting part. But what's interesting is this tire. Because you look at this tire, you see the grounds probably like here, you know, you look at this tire and you're like, oh, this thing is in the process of tipping. Anyways, I would love to see the follow-up for this. Like, I don't know what happened. I don't know where it happened. I just, I want to know. So we decided to do something stupid. So how many people know about the file in Etsy called ns-switch.conf? All right. This is when I'm teaching people Unix, I'm going to like, yeah, let's talk about Unix. It's the ns-switch, the name service switcher is kind of the center of the computer, right? It's this idea that lets us take a number that is the user ID and associate that with a user name. And so when we type LS, we can see that the user name is crumb or something like that instead of just 111711. And it works for a couple, there's a couple main name service groups. There's like users obviously, so the associate between numbers and user names. The association between groups and user names. There's the whole get ent suite, right? Where the home name, your office phone, your fax number, all that kind of good stuff. And there's also the original name service, right? Which was the domain name service. So in your name service switcher, there's a file, there's a line that says hosts colon. We all know what comes after hosts colon. It's windbind, and then, oh no wait. So it's host colon, and then the word files, and then space, and then the word DNS. And that basically means first look at Etsy hosts, and then look in the domain name system, which probably means resolve.conf, and then things pop a line. Well it turns out that you can put things in between hosts and DNS. You can just write name service switcher plugins. And so one of our buddies, I don't know how much whiskey he had or whatever, but he busted out C and wrote libnss at CD, which basically just shelled out to curl to hit at CD to return the correct name out of this key and shove it into the host name service. Which was a thing that happened, and we sort of installed it, and it sort of just worked. It just worked really well. And then one of our friends, so this is, yeah. So we set this up, somebody packed it for Arch, of course, because that's what you do. And then our other friend, who worked at a company that will not be named, wanted it to be less gross. So right as it was, I didn't even know you could shell out and see evidently you can. So there's this lib curl that's in C that you could use instead of shelling out to curl. So by making the software like 200 lines longer, we started using an actual library, so we got way better HTTP handling. And this person felt that this contribution was inappropriate to apply based on where he worked and he didn't feel like he'd get approval. So he committed it as Elvis Presley at example.com and sent it up. Okay, all right. And so our buddy, who's an Arch maintainer, decides to apply this patch to the Arch package. And so they type git patch, you know, apply or whatever. And you know what git 2.0 does? Is git 2.0 fails to apply the patch because example.com is not allowed by some RFC. So that was fun. The story isn't going anywhere. So DNS is solved. All right, cool. We have DNS. We have a weird pipeline that allows you to write keys into this stared dataset. We have a thing that allows your computer to pick them up on the other side. So if you know a name, you can go get it and you can look at each other's cat gifts. Solid. So we find this thing, which is way more hip than even at CD, which is console by Hashicorp. And it's really good software sort of. And it's just similar. It's got a lot of the same thing. It's still a RAF consensus protocol. It's still a Go Daemon that you run three or four different places. It still has that kind of 1, 3, 5, 9, maybe 7. I don't know. And it has this idea of gossiping all the information and being able to refuse a right that it doesn't like. So it's very much the same what we're doing at CD. However, this thing was optimized for the idea of service discovery in large clusters. And so there's a simpler way to get information out of it, which is DNS. This thing has a built-in DNS server so you can dig at an instance of this with a name and you will return an IP address. It'll actually return you a round robin IP address of all the things it thinks provide that service. And it actually, to make it more complicated, they added kind of a Nagios check type system, kind of like HAProxy. So what it can do is you can register four things that are supposedly responding for a service. And then it will, in a tight loop, pull them and make sure they're all there. And then if any of them actually go away and start failing the Nagios checks, it'll pull them out of that round robin. And so the answer you get when you make a DNS request will change. So this is cool. It speaks DNS on, like, not 453. It's like 8,500 or something like that. So you end up having to set up a local resolved DNS mask inside your system that you can use and it's kind of complicated. But we tried it out for a little while. And it turned out it worked. However, there was an issue with it, which is that it doesn't have discoverability of discoverables. So with that CD, there was a simple file system hierarchy. Like it was an abstraction layer, but it was still there. You could type ls-recursive and you would see all responses that are possibly there. And our dataset was small enough that that was useful. With console, you had to a priori know what you were going to ask for. And then it was very good at responding to it on a protocol that made more sense. But you didn't know what to look for and that posed a problem. It also posed the problem that part of the point is to get away from HGP, right? So that means that the information that we want to convey is not just really a name. It's a tuple. It's a name and a port. Maybe even the name of the service you should be speaking on that port if Etsy services are still working. So we did console for a little while and we ended up ripping it out. And I think the Etsy implementation was actually better, which is surprising because it was pretty bad. Okay. So I want to show you some demo stuff. Okay. So yeah, it's going to work. So if we just look at ifconfig example net, we can see what an ifconfig looks like. This is the TumTap device that Tink creates when it's done. This hardware adder looks totally legit. It's an unspecified linking encapsulation because it's just software, right? You just write a packet on this interface. The software picks it up and writes it out somewhere else. So if we want to ping, we get these pings back from another node that is on the network, which is on its own really exciting because that's conference Wi-Fi for you. And if we do an MTR to that same IP address or then same name, we can see that it's just picking it up and dropping it off. It doesn't matter how many hosts away it is, from a network perspective, it's directly there. It's also worth noting that the way Tink can be configured in a layer two or a layer three configuration, we've configured it in a layer three configuration, which means that there isn't really an ARP to spoof. It's not like you can trust IP addresses yet, but you can't attack it through ARP spoofing. So with that set up, what we can look at is protocols. The key advantage here is we have a flat network that's secure enough that we can use all these protocols that have been in the dustbin for 20 years, which is kind of sweet. So like what I like is does anybody know what that command does? That's the NFS, the network file system, like Google Drive, you can't touch this. So I don't know if people remember their university experience, but there was this awesome thing called the auto-mounter. And it lets you do stuff like this, where you just CD into a directory that does not exist. Wait a second, it auto-NFS mounts, and my prompt shows you that we're on an NFS directory from example telescope. And now we have this. We can see how big is this directory? Oh, we have three terabytes. Awesome. That's the best. We can run that command, which is a full-screen video played over the conference Wi-Fi over NFS over the public internet. And the M-player flags are like, I need a cache, please. That's what that's all about. So this is another thing you can do, though, like, all right, so the computer has the ability to expend storage from a remote source. Like we said, this is a computer with more storage. So you can do stuff that, like, you could make a user, like an NFS user, that HomeDur is mounted over NFS. So the HomeDur of this user is remote HomeNFS user, and this directory is here. So I haven't, like, tried this yet, but you could have a laptop. Like, this thing apparently has a 3.6 terabyte HomeDur because the HomeDirectory is mounted over NFS. There's no local state for this user on this machine. It's all network traffic. And, of course, Tink is set up as a user-level, as a system-level process. So the user is not required to type a password or anything. It's all RSA keys. You could actually extend this kind of pattern to make sort of thin client laptops that just connect back to some mother base. You can bring out protocols that you haven't been able to do in a while. Like, I don't know, Finger is a command for those of you who aren't around when this was a thing. Not that I was, just that I got excited about it. Finger, in the past, we all had logins on the one big Linux machine. And the way you would, like, instead of doing a stand-up, you would write a file called .plan in your HomeDirectory. And if somebody, if your manager wanted to know what you were working on, they would type FingerNFSUser, or, like, it could kind of tell them how many mails you hadn't read and stuff. So that's pretty cool. I also want to show you kind of what etcd-ctl ends up looking like. So it's just an LS. There's a lot of configuration in the background that you would discover if you tried to follow me in. But it's not, it's not, it's all setting of environment variables and URLs and stuff. It's just giving it a little, a tiny amount of bootstrapping information. Like, etcd-ctl ultimately needs to know where to go get etcd. But that's one piece, one string that you can memorize or socialize or something. And so you can LS in hosts. Thank you. And so there's the two hosts that actually exist on this example network and some fake hosts that I created for fun. There's no tab completion, which is kind of annoying. And you can, you can get an IP address out that way. You can set an IP address really easily to, you can set any key, really. Of course, that's just writing a string into a key. There's no validation on, like, you know, I mean, I could do stuff that doesn't make sense. Like, I don't know if everybody can see that. But, and you get a little response. And the etcd-ctl is the easiest way to talk to etcd, for sure. But it's actually just a curl. So if you thought about, if you spend a little bit of work on it, you could implement it in Bash with curl. Or at this point, etcd is popular enough software that there's libraries for Python and Ruby and brain fuck and everything. There's literally no requirement there. Cool. Any questions on the demo? Yeah. So there was a question of, does Tink Remote manage the files? Oh, etcd-ctl, etcd-ctl. Is this kind of how they call that sometimes? etcd-ctl. Does, what is, so your question is what? Oh, the question is, I hadn't seen etcd-ctl before. Have you used etcd without this? Oh, okay, yeah. So etcd is just running over here. That's, that's etcd. It's just running in the foreground. And then etcd-ctl is just a command that basically just makes a couple curl requests. If you look at the help, it's pretty straightforward. It's just kind of, you know, lsget set rm makedir. I really like the file system because as a Unix file system person, that's the kind of stuff I like. It's not perfect. If you're actually used to the shell, some of these will feel a little clinky. But that's better than just mystery database that I can't scan. I'll also show you the, some of the config files. So I'll just go ahead and revoke all the keys after we're done with this so I can show them to you. Safe to say that there's a key in RSA, private key. There's an RSA public key, which looks like that. I don't think there's a lot of surprise there. In the hosts, there's tink.conf. And that says essentially, this is some information. So we're going to use IPv4. What my name is, so on that graph where there were numbers, the name would show up there. We're going to connect to Becaro. We're going to connect to Spencer Telescope. And the graph dump file is that thing I showed you that will dump the dot file so you can kind of visualize your graph. It's interesting that in tink.conf is where you specify who to connect to. But you do not specify with this file who can connect to you. You specify that in a completely different way because that makes sense. So in your host directory, you have hosts files. And so if you can't host Spencer Telescope, that lets Spencer Telescope connect to me. That's the key, that's the subnet, the port and the address. And this does two things. One, it provides information that I would use to connect to Spencer Telescope. The key allows me to authenticate them when I connect in. And because this file is here, they're allowed to connect to me. Which is, to me, I would rather have a config line in tink.conf that says these are the things that are allowed to connect to me because this is kind of weird action at a distance. Another thing that's kind of confusing is the tinkup. So tinkup is just kind of an entry point script, if that makes any sense. So we'll run tinkup after configuring the network. And that's where you say stuff, that's where the actual IP address that your node should attain is set. And you set the net mass to 255, 255, 255, 255. 255, 255.0.0. Note that .0 right there. Because again, we're taking an entire slash 16. And that's the whole network. And there's no, like, routing, really. Okay. Any questions on any of that? Yeah. Can you integrate that? The question is, can you integrate that into network? That's the network interfaces. And the answer is you don't have to. Because in Etsy tink, there's a nets.boot. And this file contains the names of all the networks that will start on boot. And tinkup is always fired right after tink comes up. So one of the things that kind of makes sense, if you want to provide services onto tink, you can put a hub of your services in tink.up. Because it's just a shell script. It's actually not an if config line. So that's what kind of makes sense. Huppet so that if it's listening on all global resources interfaces, it grabs the new interface. In the front? The question is, if A connects to B and B connects to C, can they talk? So this is probably what would do that. So if one connects to nine, nine connects to six, can six in one talk? The answer is 100% yes. All nodes can talk to all of the nodes seamlessly. Yeah. Okay. So what was pointed out that tink has really good IPv6 support, both inside and outside, and that the transport layer that the kind of overlay network is connected via can be done over TCP or UDP. Yeah. Is that? That's perfect. Okay. Yes. Although in this environment, for all we know, it could be going five, two, six. I guess one, I would have to go through nine. But yeah. And I don't actually have a good sense of what commands I would use to actually interrogate that. Yeah. Yes. So the question is, does IPv6 work on the underlay network and the overlay network? And I think the answer is yes. Yeah. Over there? So it was kind of a statement. It was like, you can run tink in ethernet mode, and then overlay whatever layer 3 technology you want and use Quago or something like that for routing. Yeah. That wasn't a question? Okay. Yes? Okay. So the question was, it has no routing, so how does it decide what path to take? And that was kind of one of those things that I could have chosen better words. It definitely has a concept of which subnets are where and a sense of routing, maybe this layer, right? Because the layer we're looking at right now is tink daemons talking to other tink daemons. However, the packets, the IP packets that you're writing and shipping out over these things, as a user, you have no routing information. There's no routers that will respond to ICMP, nothing will decrement a TTL. That's what I meant when I said there's no routing. Yeah. So the question is, if we're going from 9 to 2, is it going to go 9.62 or 9.52? And the answer is the daemon itself makes a decision and it's constantly probing to figure out which path is the most optimal. And I have no intelligence into the smartness or naivete of those algorithms. I'm in the back, sorry, you've been up for a while. So the question is, are there controls to set up routing, to set priority around it, to set cost on routing? I think there's a command to set a cost on a path, I think, but I'm not sure, I've not done that. Again, this is like hobbyist stuff for me, so I'm not deep into it. Maybe one of the other experts would like to raise their hand and answer that one. Is there a question there in the green? So the question was, can you modify weights? And I think, so yeah, I would refer you to the documentation on that one. The question is, can you set the weights? And I believe you can put cost on a link, but I don't know. This is getting to the advanced stages that I don't have knowledge of in the black. Yeah, so the suggestion was that if you want to run it in a layer two mode, then you can BGP your heart's content of setting costs and timeouts and things. And the gray? So the question is, if you wanted to publish your RSA public key, could you shove that in LCD? Well, LCD would obviously accept the right. However, the LCD cluster is, we've configured it, which again is like kind of a derpy configuration. Depends on being, you have to be on the network in order to read it. So yes and no, there's an advantage in the sense that you only need one connection to the network to be on the network at all, but you're stronger with more connections. So if you get one RSA, one nodes RSA key from a friend, and then you join the network, and then you could scan the LCD to get the rest of those configurations, that's probably a good configuration. Yeah, so the question is, can you do something kind of like a split tunnel or a full tunnel with Tink where your primary connectivity to the internet is going over Tink? And the answer is I don't know. Several of us, because we're a bunch of friends, which is almost like the word enemy, right? So there's been a fair amount of like, oh, I see you have route, more than one of us runs it on like the home router that's running Linux. And they're like, oh, I see you have IPv4 forwarding set. So they try to get packets going to the Tink network and go to sites for me and get Comcast out of me. Nobody has of yet, in my configuration, been successful at doing that, even with someone's help. So we've not actually gotten it working. Is that because we're not as good at it now working as we think we are? Probably. In the back. That's amazing. So there was a statement that there was, there's a giant free software basement OpenStack Tink thing, which is probably my jam. Okay, so I wasn't actually done with the talk. I appreciate all the questions though. Okay, so there's some neat tricks you could do with this. So laptops, perhaps the most important thing is it's a flattening of the network, right? So laptops and things that typically are second class citizens, especially when that's involved, all have known IP addresses, which means that at any time, as long as I'm on the Tink network, my buddy can ping my laptop. Now it's an exercise of the user to figure out something useful to do with that, but it's very useful. It's also useful in corporate networks, because I work for huge companies, right? IBM is a huge company. And it's very common for them to have kind of antiquated networking stacks or networking security models, where they'll have completely isolated networks, completely isolated networks. And in your business unit, it's given three machines in this one and three machines in that one and four machines over here. And you would like to run some services. So by using Tink and an overlay network, you can make them look like they're all on the same network. One of the best things to do with this that somebody did is they made a backup device. So they got a Linux machine, they put like four or five hard drives in it, just a case, and they shipped it, and they put Tink on it, and they shipped it to their grandparents. And they had said, grandpa, just plug it in. Grandpa plugs it into the router, and then it DHCPs from grandpa's router, connects up to the Tink well-known IP addresses, and now it has a known IP address and can just start being disks that grandpa won't lose. And now it's just like a compute node or a storage node. And you don't have to mess with SSH-R. You don't have to say, grandpa, go to whatsmyipaddress.com. You don't have to say, I need you to press these buttons and links us to do the port forwarding. Like, it just works. So brood war over Tink is, of course, one of the most important things. It's interesting because the, and I'm not 100% confident in how this works, but because your SSH connections that go over Tink aren't really super IP packets anymore, when you close your laptop and you just open it, sometimes they live. They live for like days sometimes. So whatever process is involved in actually issuing the TCP resets that causes a broken pipe doesn't happen. One of our nodes is Trans-Pacific. It's in Japan. So we had a lot of fun setting at CDE, like a don't time out for a long time flags. So NFS, I showed you a little bit of the NFS home for the laptop user. Crazy people have run X11 with Tink where you let your machine be a free target and then your buddy who's also on Tink, XIs, this is over to your machine. Just for fun. So what's next for me? So now that we have a kind of a service index of all these HTTP resets and NFS resources of files people are providing to the network, I want to get Reddit daemon that scans those and then puts them in something like a Zapien index so we can search for which files are available in the network. Tink 1.1 is coming out any day now. Gus is working. I know he takes donations. Sounds like there's a lot of buzz in the room about excitement for Tink. So 1.1 is something you can do. You can try the beta. You can donate to the project. Development, I think I'm going to try... I'm not really a C programmer, but I am a CI guy. So I'm going to probably see if I can get some kind of a CI pipeline, like a basic like does it compile, kind of pre-flight check, at least on GitHub pull requests. So yeah, conclusions. You can build an overlay network. It's kind of one of those things that it's a part. It's not a solution. So you have to figure out how to interpret this technology to your problem space. Console and SCD are pretty robust. They're a little obtuse. Starcraft is an excellent game. And this I think... I think this is going to be a pattern for my work in the future is to take something like old, reliable, like tank, dust it off and compare and combine it with something like new and hip like SCD and see what kind of emergent behaviors can come out of that because certainly we don't have to reinvent every technology. Thank you. Are there any more questions? I know we already kind of did Q&A. Yes. Interesting. So yeah, we haven't really done scale checking. You know, at my scale, I don't have that problem. Obviously, RAM is a consideration with any of those open network connections at some point, but probably well before that... One of the problems with routing, if you think about high-end gear and how it does routing, it keeps the entire routing table for the entire internet in memory. So well before you ran into other problems, you might run into problems where the amount of memory you're consuming, just keeping the state of the graph in memory is hard. Thanks, everybody. What's your name? Nice to meet you. Yeah, I'm on the list very occasionally. It's actually a pretty active list considering how sort of... Yeah, dusty it seems. Yeah. Basically like a peer-to-peer. Oh, cool. Dropbox solution. That's awesome. Yeah. Do you have performance testing? It says about a 15% performance penalty at the very beginning. Yeah, so when I was at... How do I turn this off? Test. All right, good to go. Yes. You got it, Jason. All right, thank you. Thank you. All your friends. What's your name again? Jason. Jason. So when something... There you go. There we go. Accept. Look at that. I have probably because it's the PDF version. Oh, yeah. There you go. All right. All right, thank you so much. All right, guys. Thank you for your patience. Technical devotees are always fun. So we're going to continue this track. Amanda Folson is going to be talking about how you can use open source even if you're... Or participate in open source, even if you're a close source company. So without further ado, let's get started. Thank you. So sorry for the delay. So fun stuff. We're going to run this off of two laptops. How many people does it take to give a presentation? How many laptops does it take? Pull that away so I'm not breathing into it. All right. So just to get started, who am I? My name is Amanda Folson. I'm a developer evangelist at PagerDuty. That's a fancy way of saying that I am a professional conference attendee. I pretty much go around to things like this. Yeah, basically. And give talks just like this. My job is to interface with people, write documentation and things like that. I grew up around computers. My dad was a developer in the 90s, so I kind of grew up around like visual basic and some even some C sharp in the early years. Things of that nature. Built my first computer when I was around seven with some help from my dad, of course. So I've always kind of been into the computer thing. I've always been tinkering with things. So I would say that I am also a professional tinker. That's also a huge facet of my job. In middle school, I was making websites as you do. I was on Neopets and a few other things and learned some HTML and I was like, you know, I really want to make my own copy of this. So I had some problems with Neopets. You know, nine year old me was like, this is stupid. Why does this work this way? How can I make my own? So I figured that out, found PHP, made my Neopets clone. Funnily enough, I still have the source code for that. It is the most awful thing in the world, but it still exists. So it's kind of cool. I really kind of owe my career to open source software too. I continued doing PHP, eventually found IRC channels, which is great. Got involved in some IRC development and worked on a project called Anope, which if you're not familiar with IRC, it's internet relay chat, old school chat system. Anope is a set of IRC services. So it allows you to do things like register channels, register your nickname, all of that fun stuff. Started hacking on that and found some weird issues, particularly with a MySQL issue that I ran into. And I was like, hey, you know, this is broken for me. Can I submit a patch to it? And they were like elated that anybody in the world would want to submit a patch to this thing. It was really cool. And eventually they gave me commit access. Oh God. No. Okay. This is going to be a fun one. Yeah. So started making changes, got commit access after submitting several, several patches to this thing. And just was kind of able to do whatever I wanted. Fixed on MySQL issue that was causing a crash. Turns out it had been causing a crash for a lot of other people. Just no one was reporting this bug. So I found it. I fixed it. It was great. And from then on, I was kind of addicted to open source communities and kind of explored, you know, hosting websites and things like that. And eventually wound up exploring a whole bunch of other technology. I needed a web server to host my crappy Neopads clone. And eventually got a job as a tech writer, which exposed me to even more stuff. So it's kind of, kind of been all over the place, kind of been doing this open source software thing for a while. And like at pager duty, I work with Jekyll. I work with Ruby. I do some JavaScript stuff, even doing some PHP stuff. And simply wouldn't be here today without some awesome open source software. Forgive me while I change this in two places. So what is open source? There's actually a lot of confusion around what open source is and what it's not. So people think, oh, you know, open source is just giving it all away for free. And that's simply not true. There's a ton of licenses. This isn't releasing something as open source. It means releasing it under a permissive license for the greater good allows people to adapt and re-release your software. Open source initiative and GitHub break down the licensing thing really well. I'm not going to get into that because it's just a crazy subject. And that could realistically be its own talk. And you're sure to find one that will work for you. Even if you're worried about IP law, there are licenses that cover intellectual property and basically that you retain ownership of your source code without giving it all away, giving away all of your ideas. And licensing, super, super important if you want other people to use your code. Whether that's including your project in their own or building on top of it or whatever, without one, they technically can't use your software without your permission even if the source code is put up on GitHub. So licensing, super, super important when it comes to open source software. Putting it all out there, like I said, allows people to adapt and re-release your software. People will do bug fixes. They will fork it, add new features, do their own releases. I don't know if any of you follow the Node project, but that kind of split and then re-merged. That was a big deal. So things like that happen all the time, wouldn't be possible without, well, licensing and releasing the source code. And all of this can be done at a global scale. So if you open source your project, you're getting access to literally millions of people. I'm not saying millions of people are gonna work on your project, of course, but you might get a few thousand. So you can go from five people working in-house on something to 5,000 people. That's actually not that unrealistic. Some of these larger projects, they have thousands and thousands of committers and thousands and thousands of commits every month. Some people are doing this in their spare time. Some people are hired to work on open source full-time. Can absolutely make a career out of it. Other times it's just people using a tool who want to make changes upstream. And honestly, very few companies can afford to hire the amount of diverse opinions you will get on these projects. Very few companies can afford to hire 5,000 developers. But by open source and something, you kind of get that for free with very little overhead on your part. Come on, there we go. So what open source isn't is not giving it all away for nothing. You're giving it away in exchange for out-of-house development. And that affords you so many awesome things, which I will get to in a second. Some people think that giving it away is giving away your bread and butter and your company's gonna go bankrupt and all these things. And that's really not true. Anybody familiar with OpenStack in here? Okay, a few people. So that's a relatively large project. Rackspace started it and was basically like, here you go. And it runs part of their infrastructure. And there's still a billion-dollar company, despite releasing that. You can basically spin up your own version of Rackspace if you wanted to using this utility. And they don't care. They're like, have it. And they've actually benefited from people taking it over and adding new features and things like that has become this huge, huge ecosystem. Parts of Windows and Azure, they're open source. Microsoft is still fine. They're still doing all right. Apple has started releasing things. They've really swift. They're still doing okay. So granted, they are a hardware company, but you can release various aspects of things and not go bankrupt. So like I said, you're exchanging the source code for rapid development. And this is especially true at larger companies. If you're at a company that's got some serious firepower, you're a well-known name and you release an open source project, you're guaranteed to get a few hundred people interested in it immediately. Thousands of commits per month on some of these projects. And you can't always buy that. Maybe you can if you're Microsoft, but if you're at a smaller company, you can't buy that kind of publicity, really. So giving it away to someone who can use it and modify it to suit their needs. That's huge. That's key. Letting people scratch their own itch, so you don't have to. So essentially, you're getting people to come work on this thing for free. I like the scratching gesture. And you're not giving away your business. So there's so much more to a business than the source code you release, of course. There's marketing, there's sales, there's how you market your thing is super, super important. So never underestimate the power of lazy. Lots of people can't and won't self-host projects like this. Especially if it's a project that's really hard to set up. At last, it makes a ton of money hosting Jira and Confluence and things like that because people simply don't want to put in the effort to host it themselves. And that's huge. So never underestimate that. So I talk to people and they're like, oh, well, we can't do that. And my first response is, well, why not? They're all like, oh, my gosh, we're going to lose control. We're going to go out of business. All of these bad things will happen. Our competitors, they're going to steal our stuff and we're just going to go bankrupt. And that's really not usually true. So I'll give you an example. At PageDuty, we have this messaging pipeline. It's pretty robust. It's very reliable. And it's kind of our secret sauce. We could realistically release our web panel. Nothing super secret about that. So there are times when you do kind of want to keep things closed source just to kind of protect your trade secrets, things that competitors absolutely would benefit from if you released them. If we were to release our web panel, it'd be kind of great because it would allow people to fix our pain points. But if someone wanted to steal that, if they really want to steal our front end, first of all, they kind of can anyway because it's HTML and JavaScript. But also they'd have to build the entire back end themselves. And most people really not going to be bothered with that. If there's an existing tool that works and is cheaper than putting in the dev time, they're just going to, they're going to roll with that. Guaranteed. So talk to a ton of people and they're still not convinced. They're like, oh well, there's no value to open source software. And you just, all I can really do is shake my head. And they're typically telling me these things from their Windows PC or their MacBook, both of which can say in open source libraries, by the way, or they say it on Facebook and Twitter, which are making use of like PHP, Scala, Ruby, Apache, Nginx, all those things. They're saying it on their, I actually saw someone blogging on a WordPress site about how open source software was not valuable. And I just wanted to be like, we need to have a little talk. You are benefiting immensely. You are allowed to spread your opinions by virtue of open source. And if you hear this, I encourage you to challenge people on it, by the way. Call them out on it because I think we have a lot of work to do in terms of telling people how prevalent open source software is, just as an aside. Your smart TVs, some of them are running Android cell phones. People are making use of TCP IP libraries. Like nobody wants to write that crap themselves anymore. Same with HTTP magic. Like there's all sorts of things that everyday devices are taking advantage of. Absolutely encourage you to call people out on that. Another point, we have to hire someone to maintain it. Well, so what? It's still cheaper than multiple in-house engineers working on something or contractors or anything like that. If you're open sourcing it, I mean you have access in theory to millions of people. There are millions of potential developers that will hop on this project. And of course, if you build it, they're not necessarily going to come to it overnight. You might have to put in some marketing. You'll have to get it off the ground. You'll have to nurture the community that kind of evolves around it. But for the most part, if you have to hire one person to gain access to literally thousands of people who can potentially work on it, isn't that worth it? Yeah, absolutely. So it's not true that you will always have to hire someone to maintain something for you. You know, you might find that the volunteers will do the heavy lifting for you. And this is true at a lot of some of the projects that Red Hat does. You know, there's a ton of volunteers that are just committing code. Some of them are Red Hat employees, but a lot of them aren't. So they kind of manage the direction and do a lot of the feature implementation stuff so that you don't have to, and that's awesome. But it's not uncommon to see community managers or developer advocates or something in a position like this basically to kind of guide the community and guide the project in the direction that it should be going, but also to kind of interface with the community and bring that back within the company and say like, hey, the community thinks we need to go in this direction. Maybe we should try that. Maybe we should listen to them because they're probably in a better position to tell us that than we are to tell them what they need to be doing. So you just provide the community with the guidance and take it from there. You don't necessarily need to hire somebody. So if anybody's like, oh, well, we can't do open source because then we have to maintain it. It's like, well, yeah, but also no, not so much. And then my favorite, but what about my bottom line? People love their money. They want to hold on to their money. And this is probably the number one thing that I hear about when I'm talking about open source and in particular giving this talk. People are afraid of losing money, which is absolutely healthy. You should be worried about losing money. They want to know that they'll stay in business. Also a pretty healthy opinion to have, right? But if your idea is good, people are going to build on it. The best part is even if they build on your idea and don't release the source, you still might benefit from that. You can see what the market wants through the work of others. Immense value in that, immense value. You might run into some leachers, some people who just want to take your code and mess around with it and not actually give back or maybe they want to rip your code off, use it as their own. But they're really not that successful with it. I don't know a whole lot of stolen code projects that surpassed the parent project that they were forked from. You don't really hear about that a whole lot. So worrying about leachers is not a complete waste of your time, but if that's your main concern, I'm here to tell you, it's probably not going to be that big of an issue. You'll see them modify it. My favorite is when people, they will fork a project and then mass delete all of the headers with the licensing information, but all of that history goes into GitHub so you can actually see them deleting the comments and the copyrights and leaving that history of that behind. I love it when people do that. It's like, are you really? So yeah, most of the time they're not successful. Their project is likely not going to make waves that are bigger than yours. Going back to Rackspace, they're selling you an experience. There's so much more to the product than the code. They're selling you their quote unquote fanatical support. Everybody's seen those ads, where it's like some dude that's like, I've been a racker for 16 years and I provide fanatical support. Like that's really what they're trying to sell you. It's like, the infrastructure that's important, but at the end of the day, it's a server in a rack somewhere. Anybody can kind of do that really, well, mostly. So even if someone takes your code and creates a competing business, there's actually no guarantee that they can provide the same experience that you can. Experience is huge. People are an important factor in this. It's not just the source code. So if you're worried about your bottom line, there's so much more involved in worrying about your bottom line when it comes to this stuff than just releasing your source code and assuming that someone's just going to steal it and put you out of business. Someone will steal my idea. Well, I've got news for you. I can't tell you how many times I've heard this from people with just ideas that are really not that original, not that good. Your idea is probably not that unique. If you're creating a cloud on OpenStack, you're joining literally thousands of other people who are doing the same exact thing. Some control panel stuff might be different. At the end of the day, like I said, it's just a server in a rack somewhere serving some stuff that's not all that unique. Facebook isn't unique. Social networks were run long before Facebook was in the picture. That's been a thing pretty much since the internet. Photosharing was a thing before Instagram. Search engines existed long before Google was ever a thing. So never underestimate the power of marketing and positioning when it comes to open-source software. And I think Red Hat does a pretty good job of this. You know, they have lots of money to throw at the marketing and development of these things. And it works for them. It's like you can pretty much get anything that Red Hat does for free unless it's support. But yeah, like it's free. If you want to set it up, you absolutely can. But if you don't have the time or the patience, they have people that will help you do it. Another spoiler. Nobody really cares about your idea as much as you think. If they did, they'd be out there building it themselves. And it happens from time to time. But in the end, one project usually pulls ahead of another good example of this, Linux versus Herd. Who do you think is winning that? And value is very subjective. What you value may not actually be all that valuable to someone else. So if you're worried about someone stealing your idea, first of all, your idea probably not that unique. Second of all, you know, putting your idea out there, you might get some opinions on it that you didn't even think of. Someone might say, like, hey, well, why don't you try going down this route? Or try going this way. Maybe you need to implement this feature. And you just never know what's going to come out of that. So in addition to code, you know, the bug fixes and things, you get some other benefits by releasing your software out there. People can get in. They can get answers. They can get on with their life. They can... This is especially true of technical people. They don't want to talk to sales engineers or anything like that, not interested in traditional sales and marketing. So in my experience as a developer of Endless, you know, you run into people who they just want to go in. They want to tinker with it. You give them documentation and they just want to go do their thing. Most people who are playing with this sort of thing, they really, they're not all that interested in getting your marketing pitch or anything like that. If people have some visibility into your product, you're opening yourself up to more feedback. So one of the things that we do, which I will cover in a little bit more detail, is we invite people to give feedback rigorously. We send them emails. Anytime someone creates an API key, I email and I'm like, Hey, what are you building? You know, do you have any feedback for us? And the email campaign in particular has a 20% response rate, which in this industry pretty much anything above 3% is phenomenal. But if you're building some transparency, people will love to give you feedback and some insight into how they're using your product and what they're doing, what they hope it does, what it's not doing, that's hurting them, things like that. And you win either way. If you make a sale, great. If you don't make a sale, hopefully you've got some really good feedback from that interaction. And you know, as a collective, the technology community is pretty vocal about what we like and what we don't like, especially what we don't like. So you're extremely likely to get the holes in your project figured out as you're talking to people because they're going to be like, well, it doesn't do X, Y, and Z. I really need X, Y, and Z because of A, B, and C. And you know, that feedback is super, super valuable. But if, you know, if they're in their tinkering and they see that it works for them, then you might make a sale. And you might make a sale through some nontraditional sales funnel. You know, like someone just might be like, okay, well, you know, I see, I can get this information out of your API. I can put this information into your product via this API. That's good enough for me. I'm going to sign up. I'm going to do my thing. Leave me alone. That's great. So we, when we know what to expect, we're a lot happier as technologists. You know, I don't think everybody's first instinct is to go to the documentation, but definitely if we're tinkering with something and we can kind of pull bits and pieces out of it that we like, we're typically a lot happier. Another important thing when talking about open sources is that one size doesn't necessarily fit all. It works for you. Not necessarily going to work for someone else. As one person, you know, open stack is complete overkill for me. Maybe I just need a Zen box or maybe I just need something like vagrant. Maybe I just need a couple of vagrant boxes to spin up. You know, I have the freedom to explore all of my options, which is great. As a result, I know more about these products and especially in my line of work, exposure to different things is key and I can help other people find solutions for themselves. So it's a very positive thing to have software diversity here. If the source is open, I also have the freedom to tweak it which is how I really got into this is, you know, I was playing around on IRC, something was not working for me and I was like, okay, well, you know, what's wrong, did some stack traces, got into Valgrind and all of that fun stuff and was like, okay, well, you know, here's the fix. I'm good to go. So I started making modules for functionality that I needed and now the users of the software, since these modules got incorporated into the core, you know, they're all benefiting from those changes too. So it was something, the MySQL thing, like I said, like they didn't know that that was an issue. People were not reporting that bug. Turns out once that fix was released, then suddenly a whole bunch of people were like, oh my gosh, thank you so much for fixing that. I was, that was driving me nuts for months. I've been trying to get it to work, didn't understand why it wasn't working. So it was great and they should be able to tweak it. You know, their changes might even benefit you. I can't even tell you like, how many things that I've touched that get continuous updates that are then better for me. So like I said, one size might not fit all. It might fit many and you might find that out just by releasing your software and having people tweak it. And if it'll fit many more, a few users can establish quorum on what they need. So you can get people talking. We have a community site. Some people have gone back and forth on it and they kind of reach a consensus on some of the prioritization and the features that they need within our product, which is really cool. So the kind of the community and the customer base gets to decide really what we're doing, which is kind of cool. And some of the changes we roll out, they might only benefit one person or they might benefit thousands of customers. You just kind of never know with that sort of thing. And another benefit that I think is really overlooked is information transparency. So too often we're stuck in our own silos and develop like these huge blind spots in our product. So we're so focused on one thing that we kind of get tunnel vision and we don't necessarily notice the holes in our product. So trying to stay ahead of the market can result in some serious tunnel vision. So like we know what we want, but we don't always know what customers want, even if they give us feedback, are we acting on that? So letting the customers have control over our product direction is actually an extremely good thing. They often know more than you about their own needs. So letting them have control is kind of giving them what they need in order to succeed. So as you're trying to apply some of this open source stuff, just be mindful that even if you're leading a project, you don't necessarily know what's the best thing for that project. You really need to listen to your customers and figure out what they want from your project. And besides, sooner or later, someone will make an open source alternative to your product. Even if you make a good product and keep the source closed, someone's going to be like, you know what, I don't like that this is closed sourced. I'm just going to do it myself. That's kind of the world we live in these days and it's great. Open and Libre Office are great examples of this. Microsoft Office is pretty much or was pretty much essential for business needs. You had to have Excel. You had to have PowerPoint. Had to have Microsoft Word. Even in college, they're like, oh, well, you have to submit your assignment as a DocX. We're not going to accept anything else, no PDFs or anything like that. And there are open source social networks. There's an open source Facebook. I feel like open source Twitter APIs are becoming more common. So I deal a lot with API design and give a lot of talks on that. And almost all of the examples I see these days are either like a four square clone or a Twitter clone. So there's even open source alternatives to those. If you want to set up like four square for just your group of friends, like you can absolutely do that. You're probably not going to overtake four square at all. But the point is, really, you can do it. There are even open source operating systems that can look and feel like any Windows or Apple's box. I'm sure we all know of one or two. Or five. So these projects, their big selling point is that they have market share. So it may as well be you from the start that gets that market share. You know, go out there, release your thing, launch it, get people to love it, which of course is easier said than done. But you can get that market share from the start. And that's really kind of your competitive advantage when it comes to open sourcing your software. So okay. Let's say I failed to convince you to open source your software at your company. It happens. You know, going back to Page of Duty, we would be stupid to release our ultra reliable alerting pipeline. I mean, that's kind of a huge thing for us. We've invested thousands of hours of engineering time. Reliability, absolutely a competitive advantage. So we don't want to release that. But we do some things internally that are actually kind of cool. So one of those things is internal transparency. We maintain a wiki. And so before coming to Page of Duty, I hated wikis. They're never up to date. They get kind of stale. They're spam magnets. Anybody maintain a wiki here? No? Okay. Yeah, I count your blessings. But they tend to be kind of awful. But Page of Duty actually has one that's relatively good. And occasionally you'll find some crafty information on there. But for the most part, especially in the product and engineering side, everything's up to date. You can go in there. Like I'm not an operations engineer, but I can go in there and see some of the remediation steps for when specific alerts come in. Things like that. It's really cool. We have this concept of demo days. So if you want to see what's going on with the product, you can come to a demo day. Everybody in the company is invited. They are recorded. There are notes that are drafted up. Things like that. A lot of the meetings are just open to everybody. So especially in like product and engineering and marketing and even sales. You know, if you see an invite on someone's calendar and you're like, you know what? That actually sounds pretty interesting. You can pretty much invite yourself to that with very, very few exceptions. We also have a policy, at least on my team, of no meetings over 30 minutes. Anything over 30 minutes requires an agenda. And even if it's under 30 minutes and doesn't have an agenda, you are allowed to decline it, no questions asked. Or you can hit someone up and be like, hey, I really need an agenda for this before we can actually talk about it. So we also do open training. So it doesn't matter where you sit in the company. We have scrum training. We have Kanban training. Things like that. You're allowed to come to that, regardless of whether or not you're working on product or engineering. If you just want to learn about Kanban, like have at it, we will teach you Kanban. We will get you a certified scrum master certificate if that's what you want. We're really big on that. And another cool thing is executives have office hours. So it's not necessarily a complete open door policy, but it kind of is. It's a scheduled open door policy. So anybody can come in and be like, hey, I want to sit down with the CEO and talk to him about our product roadmap. And you can do that. And he will be receptive to it. It's absolutely great. So I mentioned open information. This is actually my Kanban board at work. So I had to blur the things out because some of them contain customer information and the sort of thing that you don't really want to open source. I don't want to tell the world who our customers are. But this is public to the entire company. Anybody at the company can open this up and be like, oh, okay. Well, here's what Amanda's working on. But it's kind of cool. So you see the guy with the green avatar there? That is an intern on another team that answers to someone else entirely. He asked what I needed help with one time. And I was like, oh, you know, go look at these tickets and see what you can take care of. And so since I'm an evangelist, I send a lot of t-shirts and stickers out to people. It's just one of my routine day-to-day things. Reply to a lot of emails. So got him started working on some of those. And then there were a lot of like little code sample tasks that were left over. And he was like, you know, what would it take for me to be able to do these? And I was like, well, you know, here are some things that you can learn. He pretty much hit me up. He's like, okay, what specifically would I need to learn in order to be able to do these things? So we taught him JavaScript. And so this is a dude who didn't even know something like a terminal existed who is now a shipping code. He's got some little code snippets running in production now, just because he saw a Kanban board at our company, which is really, really cool. So he's working towards becoming a junior dev, which is great. So through open information, he learned new things. And now he has some stuff in production that's something he can put on his resume and take to another company, right? And this task that's over here in the done category over off to the side was something that someone else decided to do over a weekend. It was just a little documentation fix. I didn't ask her to do it. She just up and decided like, hey, I am blocked on this. I want to do it this weekend. And she just went ahead and did it. And I didn't have to do anything. It was great because my board's open. Like people can just go in there and grab tasks. So not to say that I want to encourage people to do my work for me, but it's kind of nice when people can collaborate on these things and pitch in and help, especially I am currently the only evangelist at PagerDuty kind of swamped, especially doing all the public speaking and things like that. So it's really nice when people can step in and be like, hey, I know you're out on the road, but I really need to fix this documentation right now and then just go do it. So we're also big into cross team functionality, which you just saw. That's a great example of cross team functionality. These people, they're not even on my team and they're contributing JavaScript and documentation fixes. That's awesome. So anybody can contribute anything and learn the things that will allow them to contribute via the training or the outside study or whatever. Absolutely not saying that anybody can be CEO for a day. There are certain things that, you know, use some common sense here. Like don't let your intern be the CEO for a week. Obviously, definitely not condoning that. You know, not saying that everybody should be allowed to go to the board meetings or anything like that. It is okay to have some need to know things. Even open source has some need to know things. Just like it's okay to not release your bread and butter. You can absolutely give back. If you're using open source software and not open sourcing your stuff, you should absolutely be giving back to the communities that are helping you out, especially if they're helping you make a lot of money. So the teams own the projects. They establish specific guidelines. I own the documentation at PagerDuty. I have a few simple rules, which are pretty much no force pushes. I don't want to see any of that. They're actually prohibited through GitHub, which is awesome. No deploys until a test pass. Yes, we actually have tests for our documentation. If you're not doing that, it's fun. I'd love to talk to you about it. And everything must be edited and signed off on before going out. Little basic things. And people follow these rules all the time. I have had one force push, and it was because somebody, they had some sort of like issue with things getting clobbered in their repo. And I was like, hey, don't do that anymore. And they were like, hey, yo, I'm so sorry. Like they rolled everything back, fixed it. It was good to go. More often than not, people don't want to inconvenience you by making you follow their workflows. So the force push thing, not a huge ask from people. So I handle the bug queue, review the pull requests, merge things as needed. But many other people are involved in making that project run smoothly. And this is actually my favorite thing that we do. We are big on transparency when it comes to us screwing things up. So we have public blog posts about things that have gone sideways. But internally, we recognize everyone makes mistakes. This is one of the scariest parts of transparency is letting everybody see when you screw something up. It's scary. You think people are going to judge you. People are going to be angry. And yeah, you know what they might, but recognize that everyone makes mistakes and learn from them. As I was showing Captain Intern something, I took down our entire developer portal. And now he knows not to do that. So my screw up was a lesson for him and how not to deploy documentation. So you never know when things, when people are going to learn from something like that. We also have a concept of blameless post mortems. So no one is blamed. No one is attacked. No one's fired unless you're just blatantly reckless, which most people are not. Most people are reasonable. And really, no one should be fired for a bad deploy in this day and age. Like it happens. Hopefully you have some tests that are running. But sometimes, you know, just stuff happens. We all know that. These failures are just great opportunities for us to improve. And they're treated as such, you know. And like I said, we even released them to our customer base. Like, hey, you know, we ran into this weird thing with Cassandra. Like this stuff happened. It sucks. But here's what we did to triage the issue. Here's what we think happened. Here's who we're working with to make sure that this doesn't happen again. So it's absolutely great. And if you could take one thing away from this talk, it would be this. Really learn from your failure. Adopt a policy of blameless postmortems. Stuff's going to happen. Everybody can learn from that. So what are some other things that we can open source? For a company, we're a SaaS-based company, and we don't want to give away our entire SaaS product. There's all these other things that are kind of attached to a company that you can actually open source. So documentation, super important. We have API docs, tutorials, things like that and we allow other people to contribute to them. We've had people add their own code samples to our documentation, which has been great. Things that we hadn't even heard of. I just found an Erlang library for pager duty, and I was like, hey, you still maintain this? And he was like, yeah. And I was like, okay, I'll add it to the documentation. And he just went ahead and did it himself. It was awesome. I didn't have to do anything. He was just like, yeah, oh, your documentation's open. No, I'm going to add my own library to it. And that's great. Because it allows people to kind of dictate where the documentation goes. They know what they need. So why not let them help out with something like that? There's a whole bunch of libraries and tools. These are usually not sacred sauce that you can release. Other people might find them useful. Things like, I wrote this really crappy Perl script like six years ago to help me manage my DNS zones. And some dude was like, hey, this is awesome. And he forked it and is now maintaining it himself. And I'm like, great, that's awesome. Go do that thing that you want to do. And people might find them useful. Even something like we released an open source cron thing. Like people have managed their crons and that's great. But they might find utility and our particular cron utility. And they have been. People, there's a bunch of stars in the repo and things like that. So you never know what sorts of things people are going to find useful. So just if they're not an intricate part of your bread and butter, I guess, just release them. You really don't have a whole lot to lose in cases like that. SDKs, also key, again, the people who need this should be able to tell you what they need or modify it themselves. It's great when you see you roll up to a new utility and you see, oh, they've got this API. Well, I'm a PHP developer. Do they have a PHP library? And then they do. And then it's a nice PHP library because the people developing it are actually PHP developers, not like some Java dude who has lived in Java land who is trying to do the PHP and doesn't necessarily know the nuances of the language or whatever your language of choice is, right? Frameworks, same sort of thing. Absolutely release your frameworks. I would sort of put OpenStack in that category where it's part of their business, but it's not necessarily the most important facet of their business. So go ahead and release stuff like that. Same with modules. They're people that release all sorts of modules for all sorts of things. And it's one of those things. Again, you never know who's going to benefit from something like that. So we got started a little late. Thank you so much for your patience with that. Hopefully you've taken something awesome away from this talk. I would love to hear your thoughts on it. So please feel free to reach out to me. Changing things in two places. So thank you so much. Are there any questions? No questions? Okay, we'll start over here. Sure, so the question was, are there any sort of utilities out there to help you release certain portions of your project? And the answer to that is maybe. So there are different like package repositories. There's like RubyGems and things like that that you can release, or like you could get into Composer with PHP. So there's all sorts of things like that that would help you. The downside is that you have to kind of modularize your things in order to ship them to people. But yeah, definitely. Or depending on how your app is architected is really going to kind of determine how you do that. In the back. Have we found that it helps us recruit? Actually a little bit, yeah. We get people all the time. We actually have a core Chef contributor on staff. And he was like, yeah, this is kind of cool. PageRDZ and Chef, let's get a job there. So yeah, absolutely. Huge, huge recruitment tool. And I would encourage you, if you are interested in hiring developers, open source something and let them go nuts over it. There's another question right here. So he was saying, some parts of the code might be covered by patents. How would you deal with that? That's really going to depend on your company. I feel like that's getting into legal territory and you probably want to talk to a lawyer about that. Would definitely encourage you to talk to a lawyer about that. But if it's your intellectual property and you've got a patent on it, patents are good for X amount of years. In theory, you should be able to release that and be covered by virtue of your patent. Because if somebody starts releasing stuff that violates your patent, you can go after them. And you see that all the time, people fighting over rounded corners and whatever in name crap, right? So yeah, you should be covered. Hopefully you will not need to sue anybody, but you might would definitely encourage you to talk to a lawyer about that. Another question? So he's saying he has an issue with submitting patches that just kind of get into a black hole and how I would approach that. Getting permission to work on an open source project. So I have a policy of I don't ask for permission. If it's open source, like I'm going to submit a PR or I'm going to hit a mailing list with a patch or whatever. And yeah, sometimes they get ignored. I've definitely been ignored by projects. Something that has worked for me is I just keep submitting something until someone pays attention. I get at people on Twitter. I will highlight them on GitHub or whatever. Find the Slack channels or IOC channels that they're hanging out on and just kind of, I don't want to say harass, but aggressively point them in the direction of my patch or my contribution. And that typically works pretty well. But yeah, sometimes it's unavoidable, especially if it's like a project that's not really maintained anymore. That's going to be a lot harder. And that's probably getting into fork territory at that point. Do you have like a specific example? Okay. Do you have a comment on that? Okay. Good to know. Yeah. So specifically, good to know. Yeah. So specifically fighting your legal team to be able to contribute to open source. Honestly, in something like that, I mean, I can't very well say like, oh, we'll just go find another job. That's not feasible. It's not realistic. I would just keep pushing. Eventually, they will either give you an answer, which is no, or get so tired of you poking them that they will let you do the thing that you want to do. I feel like I have spoken to people, particularly at larger corporations, where that's a thing, where the company has a policy that says, any source code that you write at work, even if it's in your spare time and touches anything related to work, like they have ownership of it, and they just keep fighting. And eventually some people will get an explicit, like no, you know, you can't, we own everything, you can't do anything, or we'll give you permission in writing to do what you want to do. Yeah, no problem. Any other questions? Last chance. All right. Thank you so much. Welcome back guys. How's scale going for you guys so far? In that post-lunch coma, is that why you're all quiet? Or do you just really not like our conference? How is it going today? Thank you. All right, so our next speaker is Richard Wehring from Facebook. He's going to be talking about scaling GlusterFS at Facebook, and without further ado, I'll give it over to Richard. Thanks, Phil. All right, first, I guess, quick show of hands. How many have heard of Facebook before this conference, or heard of Gluster before this conference? Okay, good. And how many have used it? How many currently use it? Okay, it's not bad. How about Seth? Okay, Hadoop? All right, cool, cool. Just wanted to see the audience, so I know where to not spend time or spend time. All right, so my name's Richard. I'm a production engineer here at Facebook. I've been at Facebook, I guess, almost five years, and kind of seen Gluster transition from what my original manager called the science experiment to now production use. So it's kind of the quick agenda of the talk. Well, I'll kind of like give you, for those of you who have really not any experience with Gluster, I'll give you kind of like a three-slide crash course and what it is and how it works, and that'll hopefully give you some context for the rest of the presentation. And then I'll go into some deployment styles, how we have used Gluster at Facebook, how we currently use it, and how we deal with things like NFS. And then I'll go into kind of like the two kind of broad areas of where we've had scaling challenges, one in operations, and the second one is like the actual core code of Gluster, or the internals, and some of the changes we have had to make. And if we've got time, hopefully, we'll get into some questions. So first, I just want to acknowledge the great team I work with. These are the six folks that I work with, or five other folks I work with every day. Open Source is fundamentally about being on a team and kind of our team is no different. So I'm really representing a body of work done by these people today. So I just want to acknowledge that. So first, the hook. Here's some numbers of Gluster at Facebook. Data sets range from gigabytes to petabytes. Individual clusters can have on the order of hundreds of millions of files and billions of files if you go across clusters. In terms of QPS or in Gluster, we call them FOPs or FOPs. It's tens of billions per day. And namespace sizes, this would be like a single volume. They can be anywhere from the terabytes to into the petabyte range. And bricks, which is kind of the unit of storage in Gluster, we have thousands of them. So this is kind of like what we have to deal with. And here's kind of a quick lineage of Gluster at Facebook. We started out at 3.3 prior life. I had started with 3.2. And then we in 2013 moved to 3.4. And then last year we moved to 3.6. You'll notice we trail mainline. We do this for stability reasons. And I can kind of touch upon our development cycle a little bit later. But we typically will be like a version or two behind. And clients into one cluster will have tens of thousands of clients. And I'll get into like how we actually accomplish that a little later. So first off, we'll go back even further. The origin of Gluster itself. This is about open source after all. It was created in 2007 by Gluster Inc. The founder and original author was this guy named AB. He's been really good to us and the whole community. And this was acquired by Red Hat in 2011. And they've been, I think, a really great steward for Gluster. In fact, I don't think I could have asked for like a better acquire of that company to kind of carry it along. And they put a lot of resources into Gluster now, which is really good to see. So, okay, as promised, here's like a high level of Gluster. There's really three basic ways to get data in and out of Gluster. You have fuse mounts or fuse clients. You have a GFAPI, which is pretty much what it sounds like a direct API into the cluster. And then you have NFS. And on here you can see kind of the two important translators. So translators in Gluster are basically the modules inside the software stack that we divide up the functionality inside Gluster. So if you're familiar with like GNU Heard, it's basically an idea stolen from that project. And Gluster uses it. Two important translators to really understand are the Distribute Translator, which is effectively doing sharding. If you're familiar with things like Memcache, it's the exact same idea. And the Replication Translator. This is what provides you the high availability and replication of the data. And here we're just showing like a simple cluster of three shards and then three replicas inside that. So Replication Translator, we'll start off with this first. Effectively how it works if you go to a client that's doing I.O. It's updating what inside Gluster we call a journal. What it really is is little entries into extended attributes and individual files. As you write to files, it's effectively just counting how many write operations or metadata operations you're doing to that file. And they use something called a wisefool algorithm should a node to fail to figure out like, where do I go to reconstruct this data when this node comes back? And it basically makes a little matrix, figures out based on this, like if it's a 3x replicated cluster, a little 3x3 matrix, and it's gonna figure out based on this wisefool algorithm which nodes actually heal from. So in this case it's gonna pick brick one. Okay, the Distribute Translator. This is basically just using ring hashing and it's very similar to say what Memcache does. In Gluster we implement this at a directory level. So if you ever wonder why Gluster has directories on all the nodes, this is the fundamental reason why. We create a directory, it creates a hash ring based on how many nodes are in your cluster and it codes that again in extended attributes across the cluster. And these are inside Gluster what we call subvolume. So you can think of them as shards. And when we actually do file.io, we basically take the file name, we hash it, and we figure out which subvolume it should go to. A lot of the code of some like, I think it's 500,000 lines of code in Gluster is really, this all sounds really simple, but where Gluster has all the work is like, what do you do when you rename things? What do you do when you are growing a cluster, shrinking a cluster? All of this gets like really complicated. So at Facebook, when I started about five years ago, there was a lot of proprietary storage used for POSIX. Truth be told we still have some of it because there's still some really hard use cases out there. That we haven't been able to handle with Gluster, but we are still trying. So, but one of the fundamental reasons that kind of drove Gluster to adoption was really like the speed we could move. Facebook culture is all about like moving fast. And when something goes wrong and you have to like explain to a customer that it's going to take like, potentially hours to diagnose because you have to go talk to a vendor or potentially days to actually come up with a fix. This is like a non-starter, especially when you're like working with other teams across the stack that, you know, they are literally fixing things like getting hot patches out within hours. So this was something that in terms of like the POSIX storage at Facebook, we wanted to like move in that direction. So the other thing that really drove adoption is in the data center. So although our clusters did not look like this, this kind of like highlights the issue of cabling. And there are kind of like two important cabling systems inside data centers, power and networking. The, on the networking side, most proprietary storage systems have really custom cabling. They're built from like the ground up. They have dedicated back ends, like Infinity Band or Fiber Channel. And these are like, and because these systems are not ordered frequently or turned up frequently, it's really hard for your setups guys to kind of like remember how to like cable this stuff up. So you either have to like bring in the vendor to actually do it, which takes time and coordination. And then the other component is power. At our data centers, we're using 277 volt power now, which is more efficient. And a lot of the vendors are simply like not on board with that because frankly, a lot of other folks out there are still using older style power systems. With OCP, we're trying to kind of like push more adoption to this, but really the state of affairs of in terms of the storage world is they're probably going to be like the trailing end of this. So the final thing is money, like cost per gig. Like this is like how everyone always thinks about when they're thinking of storage. And the accountants get out their spreadsheets and they're trying to figure out how much stuff costs. And if you start ordering, and like POSIX store systems are not necessarily the cheapest thing in the world. They're actually like POSIX in general is actually pretty hard to solve and do it well and do it at large scales. So you pay for this. And the ability to kind of lower, drive the cost down by using commodity hardware is actually like obviously really appealing. So our customers, and this kind of like these are, these include like new customers as well as like the existing customers we had when we originally started. And we have a wide range of customers. Many teams that are could simply be like R&D. So these might be like teams that are working on AI. They might just be a team that they have an idea. They don't really know like what they're going to do with it yet. So they don't really want to invest the time on like say writing a object store API. They just kind of want to like throw some ideas around, write some C++ and like figure out, hey, is this actually going to do what we think it's going to do? And then maybe they might stay with Gloucester or they might like move on to like something like HDFS. And those that stay on, they go into like basically full production workloads and they're supported as such. So in general they're like, and we also have like the classic POSIX general purpose clusters where these are like, they kind of look like your maybe your average net app where you've got like just a slew of unstructured data from various teams. Could be spreadsheets. It could be media. It just runs the whole gamut. And you can kind of like further kind of refine this into like four different groups. So you have Archival which is your classic backups. And then you've also got like the glue, being the glue between large scale systems. This is like another important use case. So say you've got like some huge data warehouse application that's going to distill that data down to maybe only a few terabytes from many petabytes. And then that's going to be injected into some database. Generally these systems don't talk well to each other and we use Gloucester to be the glue between those systems. And then finally anything that kind of doesn't fit into any other storage solution. So if you're not like media, you may not fit into something like Haystack. If you're not like, you don't look like really databasey or maybe you were on a database but the database guys yell at you because they're like what the hell are you doing storing like a gigabyte blob in my database. You might get booted out of that and they're going to tell you to go find another home. So basically if you don't fit anything into these other boxes, we're usually going to take you on. Here's another way of looking at it. You can look at this as like IO size and data set size. Haystack HDFS cold storage at one end of the spectrum. And then you've got MySQL at the other side and RocksDB. So these guys are typically very small IOs for like a transaction and the data set sizes are generally like gigabytes to terabytes. Using some sharding magic, these guys can also obviously get into like the petabyte range in the case of MySQL. And then you've got us in the middle and we range from like data set size and the gigabytes all the way into the petabyte range. And our IO sizes are generally they can be anywhere from 4k to 2 megabytes depending on what kind of request sizes they're using on NFS. So hardware, you'll probably have no surprise here. We're using OpenVault which is an OCP solution. That's what one of the Gluster Trays looks like. And currently we're using four terabyte drives. We have 30 of these in a machine and which is 100 terabytes usable per host. We divide these into two RAID 6 groups. We also have some other hardware that we use which are kind of like hybrid systems. They use it's a hybrid of flash and a RAID 10 volume. These are kind of for like hybrid workloads that may be required. Like they've got a lot of hot data that they need to that needs to be accessed very quickly. The Gluster community is kind of working on other solutions to this around cache tiering. You're going to see that I think it's showing up in like three seven already and I think it's going to get more hardened in three eight. But we kind of like to experiment with both ways. So right now on some of these hybrid systems it's really like block level caching whereas that's going to be more like file level caching. These are near line sass. Yeah so it's basically the controller is enterprise and then the platter will be more consumer. Yeah RAID 6. 15 and 15. Yeah so underlying file system we use primarily XFS. We start to use ButterFS as well but we're kind of like we have probably maybe 20 percent of the fleet as in ButterFS. We're kind of like letting that mature a little bit. We've seen some performance issues with ButterFS. So the majority is still XFS. No Harbor RAID. Yeah we are experimenting with software RAID right now. The big question there is like the right hole as well as being able to like journal things very quickly. Like NVRAM is like obviously really nice from using a Harbor RAID card and it's actually a really tricky problem to get rid of it. In terms of the vendors we use it'll be like that we don't we do multiple vendors. So it would be I think LSI and say PMC Sierra which is I think now one's Adaptect. Yeah we use all the standard tools. So if you want to look at like what a Gluster cluster looks like at Facebook. This is the general purpose cluster as we use. We basically have sub volumes that go across racks. You know here we're showing like you know the positions are perfect but in you know in a real data center we don't really care what position it is in the rack. And we'll have in the case of OCP hardware it's going to be nine servers per rack. So we have nine sub volumes per rack and then we just like stamp these things out. So in terms of high availability NFS a lot of people always ask us like how do you guys do it. There's a lot of different options for this. I really just suggest people use what they're comfortable with. We haven't used CDDB to do this and it's just a really small piece of software to the Samba community. It's job is really to just move IPs around when a host fails. And I'll give you kind of like a quick like you know rendition of how it works. We have some file going to some node that node dies and CDB is responsible for moving that data over to or moving the IP addresses over to the other nodes. This all works because NFS v3 is stateless. If you do not believe me try this out with Gluster it will work. You'll get a brief pause and it resumes. If you really want to drive down to the stack why it works is it comes down to the structure of the file handles. In Gluster they're completely deterministic. You have a volume ID as well as a GFID and code it into the file handle and because of that any node can answer requests for any other node. So this is kind of one of the beauty beautiful things that they've done in the design of Gluster. These are the NFS statements that would be structuring the file handle in this way. So this all works but what's kind of like if you're looking at this and you're with a critical eye what's kind of like a problem with this kind of method of HA NFS anyone got any ideas? Yes. We don't support that so not a problem for us. In what way? No not really too much of a problem. Okay so the, no because again Gluster's got internal locking in the back end of the bricks which will maintain the consistency during these failover events. Well the real problem here is like this all works great within a rack but what happens if a whole rack fails? It's a rare event but it's like for most people using a file system or designing an application on a stack like this it's like it's an event they probably want to know what would happen. And for a while we didn't really have an answer to this so we then started looking for other options and this is basically what we came up with. This is basically stolen with how a lot of other systems low balance so like you know this is kind of a very classic setup for the webbies at Facebook and I'm sure like a lot of web server systems out there is just using you have a bunch of machines they all advertise an address over BGP and there's a low balancer that's going to direct traffic to these machines based on some sort of a heartbeat to detect whether the systems are up or not. For us you know we needed some tweaks done to the way the low balancing worked for Gluster because we needed like very static assignments for host port and source port meaning once the session was established we wanted to make sure the traffic kept going to an NFS daemon because it's although failover is supported you don't want to be failing over every single packet. And effectively how this works is no dies it'll stop advertising and another rack of nodes will go pick up that traffic in the IO resumes. Okay so that's kind of an idea of the deployment styles and how we do things like HA and FS. I'll get into some of I guess the scaling challenges that we've had and so in a prior life this is really what I had to deal with and you can get away with a lot when you're dealing with one rack maybe two racks maybe three racks so my mentality was quite different in terms of like what I should be doing with automation and things like that because if a brick died or an FS daemon died it was like you know pretty rare event you can maybe put some maybe a cron job in there to kind of clean it up every once in a while. Then my boss took me for a data center tour and this is what I was this is what I saw and suddenly realized that none of that's gonna work and you know after you like calm down for a bit you start to like think about like okay you gotta really like change the way you're thinking and as I kind of mentioned there's kind of like two broad challenges scaling operationally and then kind of handling any deficiencies inside Gluster itself or the internals. So operationally the first kind of thing you'll figure out or you should figure out with Gluster is like okay am I gonna build like this one giant cluster and I'm gonna like shove all my data in there or am I gonna do something else and maybe make smaller clusters and they both kind of have pros and cons you make one big monolithic thing this is you know a lot of the Hadoop folks do this they've kind of really designed a stack that like literally can do like hundreds of petabytes it's a pretty amazing accomplishment Gluster fundamentally it's not built that way it's not designed that way so instead of kind of like working against it and trying to like force it to do things that's not really well designed to do we kind of did like what came naturally for Gluster which is you really do a celled approach the cons to this are of course that you know you've got to you have more widgets so you have to like you know you have to be really good at things like configuration management provisioning like these these cells really need to kind of like like run themselves because you're probably gonna have a lot of them the pros there are some pros here too though you've got like great failure to make like great isolation in terms of failure domains so you know instead of like you know if a name node dies on the Hadoop world it's like complete tragedy and you know hopefully your failover mechanisms work but if they don't you're gonna have like you know huge amounts of data that are unavailable with a system like this or a cell design like this at worst you might have like a few petabytes unavailable and but the rest should be a okay so in order to kind of manage all these cells the the the first thing we did was like the cell really needs to like or the cluster if you will needs to like really manage itself and we built this tool called antfarm to do this which is really a cluster manager if you're familiar with the Hakedi and the Gluster community it's really going to be like designed to like take over this role this is something we chose to design in Python and completely external to Gluster for some pretty important reasons so first some teams will take approaches to like actually kind of like fork a project make a lot of like internal modifications like you know modify things like even the logging structures to like pump data instead of using like you know the standard C log C logging libraries they might use like a Facebook one and you know there's some advantages to that but ultimately you're kind of like one you're forking into your marrying yourself to things that the open source community is not you know has no idea if it didn't exist and they're certainly not going to support it so being kind of a big open source fanatic this is not a direction I wanted to go so anything kind of like specific to Facebook I want it completely external it had to like the core of Gluster had to be pure it had to be like still fundamentally the the open source product and antfarm was really designed to kind of like to do well to basically take the the Facebook functions and encapsulate it somewhere so fundamentally antfarm will do performance metrics configuration management as well as monitoring or alarming and there's two components you've got a manager node there's one per cluster it's elected based on a bully algorithm super super simple and then you've got everyone else is basically who's not a manager is just a worker and there's basically master tasks that are that the manager will do and then there are slave tasks which the workers will do and who does what is basically like if it needs to be coordinated it's pretty obvious the manager should be doing it uncoordinated and you know the workers can do it coordinate activities would be like turning up a cluster doing if a node needs to be replaced because it's been out too long and maybe a human didn't like go figure out what's going on our site ups took too long the manager can coordinate that too on coordinate activities is the node comes back from imaging needs to put itself back into the cluster announced a cluster hey i'm back i'm ready that does not need to be orchestrated by a manager that's just worker can go do that as well as submitting statistics that obviously doesn't need host levels this this can be just be sent on by a worker so one of the other important things it does is enforcing layout so we support three different types of layouts we have what we call off network which is probably the more common inside pretty much the cluster community itself which is no replicas are ever in the same network or in our case rack and the pros for this are higher high resiliency high read rates the cons less right throughput but say you know a customer comes along they're like I really need high right throughput you can give them the end network layout which is basically putting replica groups always in the same rack you first tell them they're insane they're probably gonna have unavailability and durability issues if they agree this is what they get great right throughputs but not as good read throughputs and you may have an engineer come to you and say hey I know what I'm doing you know let me set up my cluster however I want for that we have ordered I'm not really aware of anyone that uses this anymore but we support it because hey you know it's America all right so so we are growing and growing and growing and things are going pretty good and we're getting more and more clusters and more and more cells we finally get to a point where we're like okay holy holy cow this is you know this is actually becoming a lot of work to turn these things up and managing things like provisioning and you know when imaging fails a human would have to go and figure out what what what happened and try and like get it going again and you know we we use things like kick start to do imaging which is pretty automated but when you have a lot of this going on it eventually doesn't scale so JD was created and it's designed to basically be the as ant farm is to host JD is to clusters and it does things like provisioning it'll like shepherd machines through the provisioning process it'll actually create the initial cell configs and it's going to actually hand things off to ant farm to actually go create the the clusters um eventually we're actually going to be having JD monitor metrics and kind of like what the vision for this is is we actually don't want humans to actually be turning up cells at all we actually just want humans to basically be like feeding the monster with machines and it will turn up the cells on its own by just monitoring metrics so like if a cell gets full maybe it's like 70 full boom go create a new cell we don't need to know about it our cap inch guys may not like this plan but they don't need to know some people have actually said that but I don't know I'll see when I get talking about Haley then you might start to get scared but that's coming okay so there were some code changes to for operational reasons as well the first kind of obvious one for us is we're a big ipv6 shop Gloucester did not do ipv6 so we added that this is kind of one of the few patches we have not open source we did not open source this because frankly probably not all of the world needs ipv6 support and we did it in a way that makes all of Gloucester ipv6 so there is we actually removed the ipv4 support out of it so antfarm we were looking at it but then had kettie kind of came along and we feel like had kettie is like a really a better approach for the community you know the question for us really really become do we move to kettie or do we like continue on with with antfarm but we actually think like that's like it was a great approach that the community took this one we have given to the community I'm not sure if you'll see this in three eight or four but actually I think the red hat performance engineers actually really like this feature you used to be able to actually have to run like start this performance command and then you'd stop it and it would dump out some stats for you for us we are like Facebook is like crazy data driven the engineers are like even if they don't like aren't owning the Gloucester clusters if something goes wrong they want to see metrics they want to know why this is like really ingrained in our culture and for a long time they're like this thing is a black box I can't tell what's going on it really frustrated engineers so we modified the iostats translator in Gloucester so it could run full time we got rid of all the locking in this translator and then got it to dump things out in a JSON format which is kind of like digestible by almost any kind of monitoring system you can think of and from this you get something like three thousand different metrics every five seconds so it's like more stuff than you probably even know what to do with we didn't stop there we actually went for FOP sampling we want to know like again data driven people want to know like what are my worst case service times in order to do that you really need something like sampling now of course in a file system you're doing billions of operations or hundreds of thousands per second in the case of Gloucester you can't you need to like sample these things so we we created this FOP sampling feature into the iostats translator well this has been an open source as well and we gave this to Red Hat so another challenge we had which is kind of like it's not operations but it kind of is which is really this kind of like you know this notion people have around NFS and I was kind of naive to this before I came to Silicon Valley I don't know if this is a Silicon Valley thing or an everywhere thing but is this notion that NFS is evil it sucks it's horrible if you use it disaster will be upon you and your family and it's kind of an odd thing and when you really get down to it and you're kind of like unwrapped with people are really pissed about like NFS is just nothing but a set of RPC calls and it's actually like you know pretty nice that's stateless it's actually really clean it's well documented it's really old frankly I kind of challenge people that it whatever you plan on replacing it with it will probably still be here when no one uses what you built you know but you know why do people hate it you know these mounts and mounts are easy they're originally designed for local use and the standards of mounts are as such people expect local like behavior and when things on the network go wrong or things over the network go away you know these are not good at communicating to users like something went wrong what should happen hard mounts make that even worse because it doesn't even give you any kind of an error so you know it's also bad for other reasons you have to like you know when the kernel if there's a kernel bug you have to like upgrade your whole kernel well if you have like you know a thousand machines start to go to a customer and say yeah no problem we've got a fix for that just upgrade and roll one thousand of your machines like they're they're probably not going to do that so looking for a solution for this I look to the open source community and sure enough this guy Ronnie Salzburg he wrote this thing called lib nfs and this really kind of I think broke the it really allowed us to like or you know prove to people that you know it's not nfs you really hate and we did that by making CLI utilities that expose nfs as a CLI so if you want to like get data in and out you can like cat it or put it you don't need a mount anymore and it made nfs look and feel like things people really like to use like the Hadoop CLI and it also gave people and then once we finished writing all this stuff that provided like demo code so if you want to like actually embed lib nfs in your app you could so in short it really gave people an option beyond mounts and I think that was like just giving people choice kind of really brought down a lot of like the tension so here's a kind of a quick demo of what these utilities look like in action you've got you know nfs ls at the top just showing there's nothing there we're going to like echo some data into the into a file and then we list the file cat the file we delete the file and we ls and show it's not there so this is basically what it allows you to do no mounting it's completely user land if there's a bug you can upgrade this in user space it's really nice this guy we've been meaning to open source you know it's really on me I have to like you know get this working with like auto tools and stuff but we're probably what we're going to do here is we're going to offer it to the lib nfs guys and hopefully be a part of like lib nfs itself so if you compile it you'll just get these for free okay so back to internals and scaling challenges so kind of one of the things that we went to that there's actually the first cluster development last year which was really awesome look for this year's and one of the things we kind of brought to the developers was like pragmatism over correctness this is kind of the philosophy we have as like peas at facebook and an example I can kind of give you on this is like this code snippet any ideas why this could be like a really horrible thing to do I'll give you a hint okay so what this is really doing is like when you connect to an nfs demon this thing is basically having it pick a privileged port and this is actually in the fuse code so this is like you know originally the fuse clients on our stack used to do this and as our clients were as our customers were growing and growing and going we found that like man mounting is getting slower and slower and slower and we were kind of like trying to figure out why so we dug and we dug and we dug and we found like this this like beautiful piece of code and like you know since like the days of like a Raspberry Pi are here that you can get like you know a 1024 port and below for like $10 like I have no idea why people even bother putting this stuff in their code although it's correct if you look at the nfs spec this is what we're supposed to do it's kind of insane so people need to kind of like as developers I think they need to kind of like at least put an option to say yes you know I've satisfied my you know the correctness of the of the spec but I'm going to give you an option to get rid of it for performance reasons oh and another example we found is actually on DNS lookups we actually found that like DNS lookups are happening for every inbound connection and this is something although again correct and there may be reasons you may want to do this for security doing this like in line not very scalable and you know we we've got like you know little pieces of C code that can prove that you can you know do thousands of DCB connections in the blink of an eye so like you can scale really huge numbers even serially if you avoid some of these issues so anyone who's set up Gluster before have you ever seen this? This is like an IO error this is probably one of the first things people see and hate and if you go through the Gluster docs they will quickly tell you that the way you solve this is you go to the back end and you go figure out what file you really want and you pick that one because this is basically a split brain you've got two or more copies of the file and Gluster doesn't know what to do we saw this pretty soon after not even getting that many hosts going so probably in like we're in like the hundreds of bricks this stuff we start to see and we knew this is something we need to solve and as it turns out when you actually go ask a customer and you say like well which one should we pick? they'll either say I don't know they'll say I don't care or they'll say you know pick the last one pretty much what a human's gonna do right? they're gonna pick by size they're gonna pick by time or they're gonna pick by majority like and we basically modified AFR to do just this and you can see here it's called favorite child and you can see there's a split brain on the back end and we resolve that automatically without any kind of IO error happening to the user we do log these things so we want to know when they happen but the key here is we want to provide availability to the data or for the data and lately one of the other things we're kind of like tackling to kind of like go next level on this is at the DHT level we just finished a patch to actually handle really exotic cases where you may have like shards disagreeing with each other in terms of what the hashring might look like or what the GFID is on the hashring and we can resolve these cases without any kind of data loss so yeah in this case we're picking by size yeah so traditionally typically what we do is we use the majority policy so we'll pick the majority case and we'll we'll go with that and since we implemented this we really never had any customers come doesn't say hey you bastards you lost my data it's been actually working pretty good so the next thing we had issues with was access control and if you go look at vanilla Gluster and you're not say Ganesha is now an option I'll throw it out there but vanilla Gluster this is kind of what you had to to live with which was this like off allow off deny system where you give it like a list of IPs you can even use wildcards and it would you know yay or nay access into an NFS demon this you know when you're faced with like tens of thousands of clients you can clearly see this doesn't work so well so we hired an intern and we said hey intern solve this problem it ended up being like a really good compartmentalized problem for a summer and you know it's kind of like already a solve problem out there in industry and there's a lot of like well-defined ways of solving this and we chose using net groups to do this and it was something that we actually had used you know an enterprise system so we knew it worked and we had a lot of infrastructure in order to to kind of like generate net group files and effectively how it works is we take you define an export file like this and we have a job in the background that basically just scans these these exports and will create net group files against a really huge database that knows in say a tier called you know a host scheme called my tier for example it'll figure out what all my tier hosts are and it'll generate a net group this is then sent to the to the machines using chef actually and from there this actually like scales and you can actually like control access on thousands and thousands of machines which are in turn limiting access to you know hundreds of thousands of machines okay so another internal thing this is probably one of the scariest charts a PE at Facebook can be faced with you I would wish this was like you know dollars in my bank account on the side but it is not it is memory and this is like this is a memory leak and you know probably last year and the year before you know kind of a lot of the low-hanging fruit problems were being solved and we were faced with stuff like this and you know at the beginning we really didn't know why this stuff was going on you would find this you'd find maybe a a brick that had high CPU high memory you might re-kick it maybe do a state dump beforehand and it would drop down and then you know it may stay down or it may go up but this was really bad and you know machines running out of memory this is even worse than like a machine dying because it's kind of like hobbling and in distributed systems like a zombie machine is like way worse than like a dead machine because you don't know like should I boot this thing out should I keep it in these are like really hard things to automate so ideally you want to like make code changes so they really just can't happen and in this case we this eventually got tracked down to like locking you know you know there's basically some misbehaving clients or a piece of cluster out there that is controlling that has a lock on a file or directory and it's not giving it up and these are also like really hard problems to debug and because they're hard to debug it was really hard to write patches for them because we would see this on a system we would come look at it we do a state dump we'd see tens of thousands or hundreds of thousands of locks pending and then you're trying to like piece this all together to figure out like what series events actually made this happen so kind of what broke the log jam on this was creating this feature called monkey unlocking this is like a developer feature what we do is we we actually like on purpose will drop 1% of lock of unlock requests and this creates these really rare cases into like really common cases and then the idea was okay Gluster must handle monkey unlocking with it on and it must be able to like not block when this thing is operating and once we once we like created this it actually became like you know pretty straightforward on like how to make the patches how to make sure they worked and how to ensure stuff like that graph back there never happened and we created lock revocation and the idea behind lock revocation is if no one contends for your lock we're okay we won't go after you everything's good but if anyone is contending on your lock we're gonna revoke you based on two parameters one is time the other is how many people are blocked behind you you can use them one or the other or both it's really up to you and it's also positive when we revoke a lock we're gonna send you back E again if you choose to ignore you again in your code you may crash that's your problem not ours write v clearly stays in the docks this can happen this is why it can happen and there are the options down there in practice we don't see too many people crashing any again we are blessed with pretty good coders so they're handling a pretty good all right final internal change I'll talk about which is replication so Gluster actually has kind of went through two generations of replication we don't use either of them because one of the things you find when you're trying to like scale things to really large numbers is you need simplicity so the fewer moving parts the better and this was like a case of really that so we created this thing called halo replication which is really a collection of patches that enable this form of geo replication to take place first patch we multi-threaded the heal daemon this we've given upstream this could be useful for other people if they just want to make things heal faster depending on the hardware using most people may want to heal slower not faster but we have pretty beefy hardware so faster is better and this is important when you're like geo replicating obviously because if you're dealing with high latencies you want to like get as many packages you can in the air at the same time the other thing is people come to me and they say okay rich this is really great the first question they have is like what happens when actually I'll get back I'll get to that in a second non-destructive GFID split-brain resolution this will be nothing until probably the next slide but it's really important and basically what it is is if you have two copies of the same file that have a different GFID if you're not familiar with Gloustra it's basically like an I know number on a file system what do you do and we created a patch to use similar techniques as I discussed on the split-brain to resolve these cases in this case though it's non-destructive we do not we rename the data and finally the halo feature itself at its core requires only one option it'll actually figure out for you how many data centers you have and it will you know take its best guess and form replication zones and within those zones you will have asynchronous IO reads and writes if you're a geek like me you may want to like tune this and you can tune things like the minimum number of replicas that you want before you acknowledge or write as well as the maximum number of replicas you may want even because maybe within a zone you've got six replicas because you're Netflix and you're you need like a lot of read capacity so you may say you may limit how many maximum replicas you do synchronously and then you can decide if you want to fail over enabled or not which is enabled by default which is basically if some failure happens you can decide that even though there's only two replicas in your zone you may want to bring in another replica from another zone at the cost of hire of latency reads and writes but you want that extra durability so so you may want to fail over enabled and also we have this notion of min samples which is there's at its heart as we look at these pings and we're kind of this is how we figure out like where all the data centers are and min samples is like how many pings do I have to see before I actually start making calls and the system goes in like a synchronous state until that many samples is is received so this is the way you can think of Halo geo application we've got like three different data centers here we've got kind of maybe this is like the west coast this is like the east coast and on the east coast maybe there are 12 milliseconds apart and then it's like you know 6570 to the west coast and Halo is going to form replication zones based on what Halo setting is set in this case it's about 10 milliseconds so it says anything that is within 10 milliseconds and this is like the brick nodes themselves so if you're like an nfs demon you're trying to figure out like who should I be talking to synchronously it's going to do this based on the Halo value so it's going to say okay I can see you know maybe 20 bricks in my zone and they're 10 milliseconds away from me that becomes I'm going to talk to them synchronously and everything else I'm just going to like let the heal demons do it asynchronously but maybe you're a weird customer and you that's not good enough and we do have some some folks like this that they're like one data center not enough I need two data centers Comet hits takes one out I need my data safe but I also want a third copy over there for maybe some other reasons you can actually do this using a fuse mount and you can actually just say this is you know I want 20 or 30 milliseconds inside for my Halo and you'll get two data centers synchronously so fundamental thing it's like extremely flexible and then of course we have heal demons these are the guys that are actually pushing data between the data centers and for these guys they just use infinite halos so they're just they will see everything and talk to everything and be able to shuttle data around so just kind of some of the the reasons we like Halo and I think our customers do too super easy to use all the standard tools work you can use a little manifest you can use GFAPIs NFS CLIs and the fuse mounts it's got some cool behaviors as partition tolerant if two regions are up but disconnected they can both receive rights and how we do this is actually using that GFID unsplit logic that I mentioned before will actually allow you to write to two regions simultaneously and we will be able to handle the case of figuring out like who wins and when we do figure out who wins we're not going to blow away the loser we're going to just rename them on the way and we do it pretty much like what a local file system would do which is like the last writer wins and it's pretty performance so six hosts we can get up to a gigabyte a second 450,000 files an hour if you're more interested in files per hour and this like this scales perfectly so if you add 12 18, 24 whatever you'll you'll you'll get it so future work and current challenges hardware raid this is something that we are looking at removing from our stack we do not know really how we're going to do this yet it's actually a really hard problem to get rid of NVRM so you know my first inklings are kind of I think this maybe one of those things that if we can actually solve and get a ratio coding into production and really hardened these guys data labs in Spain did some awesome work creating the disperse translator what we're going to do is try and get that in production in our systems and really really hard in it so it works at our scales and we're hoping that maybe that maybe the key to J bought as well we don't know yet and then the other thing is multi-tenancy we're basically making all of our clusters look the same act the same if you become one of our customers you're just going to get a standardized sell and you may or may not live with other customers and another pretty hard problem because we have to like QoS and provide things like self service and we've got QoS down pretty good we actually have a patch that will be upstreaming probably on this quarter which is to do throttling at any directory level and that's kind of been the key for us to get multi-tenancy to work that's it yes yeah so that's cross-country yeah and they're both writing the same file so if they're both writing the same file and if you're doing the read async they're both going to they'll both be reading the data that they're seeing within their region um we allow so depending on the customer some people want they have really really high consistency requirements and for those we have to basically tell them listen like you can't really have it all you can't have if you want 100% consistency we can give that to you but you're going to be in sync mode so you'll have the georeplication but we're really going to have to consult all regions in order to like answer that perfect to give you that perfect answer to your question in terms of consistency so so Gluster is going to Gluster does granular locking so as it's replicating that file well it depends so the the very common case to say it's a brand new file that one's like way more easy it's you know a file might be being written in the west coast and red in the east coast the first thing Gluster is going to do is create a file of some of the it's going to fall fairly significantly sized file whatever it sees at that moment on the other end and it's going to begin back filling the data Gluster then has like granular locks that it's going to enforce on readers in that region and they should be able to they will not be able to read past the lock so they should actually get pretty consistent reads in that case the more exotic case would be like the random right case this is something we just tell our customers don't do this we do not really support this the granular locking in theory should should protect you meaning that while the replication or the heal daemons are actually replicating data into that regional file it's going to be locked you won't be able to read it but this would not be something you're going to want to run like a database on like we'd tell them like you know use database database systems have their own replication mechanisms because their requirements are very very specific so this is more for like very you know like photos or videos or things you basically you open you write it and you want to replicate it go ahead not open source yet so we had a pretty down pat on three four on three six we're we're almost there I think but I think three six is really what most of the community is going to want to run it on so there's kind of a few final things we're kind of like touching up and we don't like throwing patches over the fence that are like not hardened because we're since we're like writing the patches we can roll a bit faster and looser than like an end user so we don't want to like get patches out there that are really not baked so what yeah you know it'll probably happen probably before the Gluster Development developer summer we'll get it out there yeah yeah so internal customers so like what's that might be one of them instagram might be one of them like these are all to us customers yeah yeah yeah so photo video to be clear this is like ever store haystack this is their bread and butter we may store bits and pieces of that but that's usually for like not for front end access this is usually like people might be doing like experiments or trying out different you know video codex or or these kinds of things but like you know we you know someone comes out and says I want to store photo video boom we shoot them over to the team that is best suited for that go ahead so it varies on the cluster I would say the number I do have taught my head is probably 80% of the F ops on a Gluster cluster in our systems are actually metadata and only 20% are actually reads and writes in some it gets as high as 90% in terms of reads and writes I think the last time I actually broke down that stat it was almost 50-50 actually which actually kind of surprised me because using on like NFS style systems it's like you know there's usually a lot more reads than there are writes but for our users they're pretty heavy so it I think last time it was about 50-50 yeah like Gluster at Facebook yeah just those six guys that were at the front of that presentation so yeah yep so currently we depend on our hardware raid cards to actually do that at a block level the background scan is all done by the hardware raid cards at the at the block level so Gluster actually has bit raw detection coming we don't use that yet what we're probably going to do is roll that into our J-Bod project because with the without hardware raid that is something that like you have to own and do so right now we could pretty much like delegate that to hardware raid and hope it does its job yeah we actually have since we run about 20% in ButterFS it's been a good gut check to see like is hardware raid like lying to us and the answer is not really like it's actually fairly rare we have run into a few cases where ButterFS has pointed out corruptions that the hardware raids that did not catch but they're rare enough that we were that we weren't like concerned and we never have never seen it on three replicas so in these cases usually just drop the bad data on that replica and you just reconstruct yep no so not yet this is something we we actually did our own snapshot work using ButterFS we've done some experiments with LVM to see like talking to the Gloucester guys at the last summit these they seem to be pretty confident in LVM snapshots and it is like file suspect non-stick so it's got some nice properties there but in our early experiments with it we're still kind of not super pumped about the performance of it so our customers are like really they don't like blocking like if anything blocks for almost any reasons we hear about it so yeah at the back so it can be it can be a hundred percent or it can be 10 percent it's really up to the operator of the cluster we used Gloucester is actually I think one of the cool things that Gloucester designed in I think pretty early on is they've got multiple queues for different operations so you've got high-pride queue you've got a normal-pride queue a low-pride queue and a least-pride queue and you can actually operate heal daemons and a least-pride queue and that enables them to not like suffocate out your production workload so typically what we'll do is we'll break down we assign threads to queues based on how many cores our systems have and then we typically would do like two or three threads on a least-pride queue so not all of our clusters are running that today because we kind of really only got comfortable with the least the notion of least-pride queuing in three six and three four I think it either worked and not so great or not at all I can't quite remember but yeah at least-pride queuing would be your your best bet yeah so if I had to pick I would probably keep it because I actually feel it's like a hard problem that's been like well solved and you know I feel like the you know I you know utmost respect for the guys that adapt tech in LSI I think these guys know what they're doing and are really good with it so I think a lot of this is driven by like people wanting more metrics of like what the individual drives are doing so I think like the message to Navy the hardware raid vendors out there is like expose a lot more metrics because there are power users out there that want to know like individual like latencies to individual drives what are the drives doing all that kind of stuff so I think some of that is really driving it from like an engineering standpoint yeah so the the throng feature that we're putting out is designed to really handle those situations now this is brand new to even our team typically in the past what we did is for if we saw a really heavy hitting customer we'd actually put build their own cell and move them there in kind of the the new world to really kind of like help us scale is we really need all the cells looking identically if they are truly big enough like they're a multi petabyte kind of use case they may get their own cell but it's really by convention not like from if you look at our config layouts for example it's going to look like any other config so with throng what we'll do is we'll go to a customer and say hey how many fops do you need odds are they're going to be like what the hell's an fop and I have no idea how many I need so what we may do is be like okay run with your like run with your workload we'll then track in their namespace how many fops we see and then we'll say hey that's great this is the cap we're going to place you at before we're going to shunt you to that least price you I just talked about and and you know and in some cases they just may need to change our workload because maybe you know a lot of cases you see people that are doing like you know a 4k read or something you're like well why are you doing a 4k read and they're like I don't know and we're like well go look at your code because like you know sometimes they're so far abstracted using various libraries they have no idea what like that read actually looks like when it gets down to the syscola all right I think I'm out of time but you can pull me aside and I'll be happy to answer any questions thanks test test oh are you able to hear me oh it's loud is it kind of loud is it okay all right thanks so I uh excuse me sir yeah we had one spare connector so it's all set now all right thank you all right should we get started yeah 430 all right good evening everyone thanks a lot for coming to the presentation for me this is the first time at scale and I've been having a blast here the organizers as well as the community has made this even such a huge event and a successful one thanks again for coming all right as you can tell from my accent and just by looking at me I'm actually from Sweden just just kidding I'm actually from India and let me take you back to your time when I a long time back actually when I did my undergrad in India I was I used to remember I still remember actually that I used to love going to the computer labs that was my passion but unfortunately in my university for my class we were allowed to go to the lab only on Tuesdays between 4 pm and 6 pm things a lot better now but back then it was in the case and I had to find a way to make my way into the lab at that time and get a computer and do whatever I want to do and then come out it was very constraining but at the same time I kind of learned how to make it work in that environment then I moved on to U.S. where I wanted to do my masters and here the first thing I did is the registration desk I asked about okay what is the lab schedule the registration day was coincidentally on a Tuesday and they said okay here you go here are the keys to the lab and I was like thinking I was shocked really? so no not only do I have to come on Tuesdays between 4 pm and 6 pm I have to come here get the keys every time and then return it after I used the lab like that's ridiculous that's worse than the public restrooms and gas stations this is how I was thinking and then the lady clarified to me and said no no you get to keep the keys you can use the lab anytime you want this is just like your facility I was like wow this is awesome this is dope y'all peace actually I said it more like thank you madam really appreciate it but you get the point you get the point so I was I collected the keys I was super happy I was doing my happy walk to the class and I was thinking in India the safest place to sit in a classroom is the last row next to the back wall so that it's like you don't see the prof a professor and the professor doesn't see you it's a good setup that way it's like the lines marking their territories if you will fair you don't ask the professor any question professor doesn't bother you you just get your degree done get the heck out all done all right but I came I stepped into the U.S. classroom and I was looking for the last row and it happened to be the highest row in the entire room the room's kind of elevated and I was like what's going on here this is like a zoo animal a lion from the zoo getting exposed to the forest animals now so I was like I remember thinking the very first day I'm screwed I'm screwed so basically I had to I had figured out a way to learn about learning in India and when I came to the U.S. I had to unlearn everything I had learned about learning and then relearn the ways to learn here given that I used the word learn like 50 times in the last 10 seconds you guys probably figured out I'm going to talk something about lessons learned yes my name is Victor Gajendran I'm from the ticketmaster team and I'm here to talk about lessons learned by doing DevOps at scale at scale all right that was a complete coincidence anyways so before as soon as I mentioned ticketmaster to people there's a popular question that comes up let me throw it out there clear up the air I'm sorry I cannot get you free tickets terribly sorry about that all right so now that we got that out of the way let's talk about what I'm going to talk about in this in this presentation obviously I'm going to set the stage give you some idea about about give you some idea about I don't know why this lights automatically moving give you some idea about why why ticketmaster is special what's our environment to where we come from and all that and then I'm going to take you to our DevOps performance and then finally I'll also take you backstage to show some lessons learned things that we picked up along the way all right let me pause at this moment and tell about tell a little bit about what this talk is not about actually let me show a little bit of code I'm showing the code just to show that I'm not going to show the code in the presentation it's because trust me I've looked for programmatic ways to solving DevOps in your enterprise scale but there isn't any so if you're looking for starting a great open source project to address a market segment that has not been fulfilled yet untapped opportunity go for this programmatic solution for DevOps for enterprise and this is a project that will not get you anywhere sorry give me one second slides moving automatically let me fix this all right it should be back now okay so so let's get going so who are we we are ticketmaster as soon as you think of the name you think of tickets and these are people that sell tickets right but what we actually do is we connect fans to great experiences that we kind of bring them together and that's the specialty and basically what it takes is a deeper understanding of what the fans needs are and what their preferences are profiles are and they're matching them to the right opportunity so let's take the experience like even 10 years back like how was the ticketing ticket buying experience back then so some friend in your neighborhood probably told you hey there's a big show coming up and then you wake up that morning if you get box office opens around 10 o'clock so you go stand in the line as you as you walk into that place where they have all these counters your mind makes split second decisions as to okay there are 50 people in each line that this line moves at this rate so i'm going to get my ticket faster if i go to line number three so your mind makes this complicated decisions very quickly and then you pick a line using probability expectation that you're going to get the ticket faster than the other guy right as you are in this line and the line progresses slowly you start to wonder okay am i going to get the ticket when i reach the friend of the line you're anxious you're just waiting but the guy in front of you as it as he reaches as you reach closer to the counter a guy in front of you picks up the phone and start talking on his cell phone and you wonder what the just get out already get out of my way right and then he goes to the counter he buys his ticket at some point he picks up his big bag and then pulls out his checkbook and starts asking you who should i address this check to and then can you spell this for me oh it's like come on just get out already please don't judge me i think all of you have been in that spot where you have somebody in front of the line just blocking the traffic so just just compare that experience to what it is right now with ticket master now you don't even have to worry about which shows to go to in fact ticket master makes those recommendations for you and and then when you you could buy it at the convenience of your home and then it is also verified tickets meaning if you buy a ticket from ticket master it's guaranteed that you will enter the venue and once you're in the venue we also sell technology solutions to enable you to order in menu solutions solutions like you could order food and beverages from your seat right to be delivered to your seat so things like that so we have a wide range of technology and products that we offer and and that's what makes our company unique now we've been doing this for over 40 years so you could think of us as not just ticket master but it's more fair name I would call us the experience masters all right so our technology view so we have a wide range of technologies this is a company that's been around for 40 years like I said so we even have wax computers actually not the physical ones looks like they don't sell it anymore but we have wax emulators running on lx machines on which we run some a small piece of our software and that's on one end of the spectrum on the other side we have modern containers Docker and we use configuration management chef and we use Kubernetes we've started looking into that so there's a wide range of technologies we have pretty much everything in the middle as well so you got a note one thing ticket master has been the leader in the ticketing industry primarily because all this technological innovations have driven its business innovations all right so we do have we are a big fans of technology and there is a lot of investment that goes into building great technical products at ticket master we have about 20,000 plus virtual machines and we have a whole sale warehouse type massive data center that hosts all these virtual machines and to so one might wonder like how do we how do we operate these these this type of massive infrastructure at scale without automation so obviously we did build our own automation this is back in 2002 before chef and all existed our engineers got together and they built an internal configuration management solution that is that is still being used in some cases in the company some of our engineers went to the chef training a couple of days back and I saw a tweet in our internal Slack channel a ticket master's internal Slack channel that hey the way chef does configuration is very similar to our internal tool and that that to speaks a little bit to the type of the engineers we had back then we also had another set of engineers that tuned to the kernel of the operating system primarily Linux now you might ask why why in the right mind would somebody go and tune an open source code well it breaks the it folks you out so you won't get any support and you can't like and because this is a special use case you can't give it back to the community and get it supported henceforth but there is a unique reason for there's a specific reason for why this happened and that has to do something with our traffic pattern ticketmaster the the the website ticketmaster.com it has heavy demand at times and we go from zero to 40 million transactions in less than minutes this is a very unique property to ticketmaster because there are lots of people out there companies out there that have much broader load on on average but at the given time this type of spiky traffic pattern is kind of unique to us if you were in the auto industry we'd be like beating Bugatti Veron which is like zero to 60 to one five seconds you'd be faster than any of that right sometimes our traffic pattern so spiky that we thought that we are being d-dost by by some hackers from some university right but then it turned out to be our customer traffic so there are two factors that explain why we have such traffic pattern well obviously the number one is we have 54 million unique monthly visitors coming to our website and these are fans the loyal fans that are keeping coming back to us for buying tickets and then we have something called on sale events what's an on sale event the little bit of refresher on the technology sorry ticketing industry the way ticket ticket sales works that an artist announces a show and the inventory is made available it's made available typically at 10 o'clock eastern time and then it is made available to that that people in that region and then they open it up at 10 o'clock central time for that that group of people and then mountain time and pacific time at 10 so from the west coast perspective you start seeing peak loads at 7 a.m. our time 8 9 and 10 a.m. so that's why you see huge traffic patterns so basically think of it this way on sale events are events where the tickets go on say for a broad broader group of population that usually the venues are smaller I mean relatively smaller for the demand that that they some of the artists see so a lot of people competing for a small set of tickets so that's what causes this kind of demand spike and and crazy all at the same time kind of a influx traffic so and and also most of these key on sale events have happened on Fridays in other words every Friday is pretty much like Thanksgiving Black Friday for us where you just have to don't have to fight over a $50 TV but you can do it at the convenience of your couch but at the same time from our perspectives like lots of people coming to us to buy our inventory all at the same time and that results in a traffic pattern that looks something like this this is an op-net graph from one of our real on sale events that happened recently a couple of weeks back and as you can see here at seven AM up until seven six fifty eight or so there's not much traffic but seven AM you see a massive spike in traffic it rises up like it's nobody's business and then comes down a little more gradually as we start serving so this is this is one of the reasons why our engineers in the past had to go down to various to low levels of OSI and start tuning so that it's this faster processing of the incoming traffic tickets all right so at Ticketmaster we have around 150 products these are products of different kinds and and there's like 400 plus API or so is groupings and this is pretty much the tech stack I couldn't include everything it's already crowded hopefully people in the back can see all the letters so let me let me start reading one by one actually I'm just kidding we're not going to go through this one by one the goal of this slide is just to show that hey we we have a lot of technology it's a complicated environment and we are primarily an open source tech shop our we encourage our universe I mean our staff to use open source technologies primarily and we believe that a corporate can cannot solve the problem like the broader community can solve and also keep pace with it our resources are limited so obviously we use the open source technologies first we try them out see if it works and then we also encourage our engineers to contribute back in fact that we have engineers on our team who participated in developing Chef and there's an engineer from the Java Spring team and so forth hopefully this gives you an idea about what Ticketmaster is what are some of the technical environment attributes all right now that the stage is set let's let's get into the actual show so the DevOps performance that I wanted to walk you through is actually a journey that Ticketmaster took it's I'm going to describe it as starting at the inception point A and then I'm going to walk you through a few Pivot's B, C and D that kind of drastically changed how we progress along in this DevOps journey which it's a journey of seeking true north B, C and D are called Pivots Pivots is actually a lean startup term we wouldn't get into too much detail on the lean startup principles but I'm happy to take questions Pivot is basically a structured course correction designed to test business hypothesis or a business assumption so that this journey is not drawn to scale it's actually just to show how we progress along in our DevOps maturity model and point A is roughly 10 years back and I'll be bringing you back to where we stand today all right so the journey started in A it was pretty flat and then we went through it had soap opera style journey there's lots of ups and downs and happiness and sadness along the path and then we went to the peak of confusion we thought things worked and then turned out it didn't work for very well and then obviously I'm from Bollywood so I'll definitely end with happy smile and dance all right so so in this presentation you obviously kind of figured out that it's not about the journey or how it started sad and ended happy it's more about what we picked along the way that is going to help you take it to your environment and then do something with it so hopefully I could share something meaningful to you today okay so let's break this journey down into four components four phases so first one is pre-developed and then I have given names for the others as well and we go from red to green and getting bad to good right okay so now the like I said journey the point A started roughly ten years back and as I described we have a chef-like internal configuration management solution we have Colonel Tune we have smart engineers that could really solve any complicated problem well we we started building all these internal tools and kind of started falling in love with what we built rather than the problem we were trying to solve the product that we created happened to be our babies and we started caring for them more so we started stagnating meaning we fell victim to our own success like in many industries even in technology field if you are not rising you are falling right so we started not enlisting and modernizing our technology and therefore we started falling and that explains the direction of the of the error that you see here and then pre DevOps era and there are several attributes this particular to this particular phase I'm sure you most of you know this so let me quickly script through them like any dysfunctional organization that does not build products collaboratively we also had Dev and Ops that was building a tall wall of confusion and between Dev and Ops right and the developers would finish their product they would throw it across over to the Ops with some labeling on it and some Runbook and then the Ops guys would look at it and go that doesn't make any sense let me throw it back to them because the Runbook doesn't have the proper punctuation marks right so they send it back to the Devs and the Devs then look at it and then they keep finger pointing at each other this results in a lot of wasted cycles and lots of rework and therefore eventually you don't end up producing quality products quickly at a reasonable cost so this was one then we had TechDat I mean it's a decades old technology company we have all kinds of versions of technology so we obviously have TechDat like any other debt it either grows or it falls right so if you don't invest actively in cleaning up your TechDat then it's gonna build up and it's gonna exponentially impact your productivity down the line that's what happened to us as well obviously we had long cycle times meaning the time from code commit or even the requirements definition to all the way to a customer seeing that feature in production was was getting longer and longer so if an environment is characterized by these three things then obviously what's gonna happen is we're gonna get our butts kicked by somebody right miraculously for ticket master that's exactly what happened we did get our butts kicked but this time it is by a bunch of nimble agile small startups so these companies started coming out away and then they started eating our market share away from us and ticket master couldn't compete with the with the pace at which the features were getting released by these small companies and and that we were starting to get concerned about that from a business perspective and that's when we reached the point B or the pivot B so this in this deep bit of sorrow we had to make a decision it's a sinker swim moment right so you either change your ways of operating and get software out there quickly that meets customer needs or you kind of lose your continue to lose your market share ticket master has been the leader in the industry so that that's something very tough to swallow obviously okay so we we attempted DevOps and a few examples of our first statement are shown here obviously we all know that multiple silos is a bad thing great so we said let's go and build a cross functional team where the developers and the testers and the operations folks and they just get together and build they have a common goal and all they need to do is quickly get the product out to production as well as keep it running it in production just it just works it's expected to work well that's great and the next one is we we I talked about cycle times one of the deeper problems of cycle times was that we were not taking any feedback from our internal customers our development teams it's not like we did not have any build pipeline or automation that will reduce cycle time it's just that the automation that we put in place did not meet the development or the development or the delivery team needs so they they when they came to us we'd be like okay we already know what you guys want because we've done this so many times in previous companies and whatnot so we know exactly what you know and I'm going to build it obviously they did not adopt the solution and it continued to have poor cycle time okay and some of the folks even quoted Steve Johnson as the customers don't really understand what they want so as I present these things I also want to caution you that these solutions even though it worked for us may not work exactly for you so you might want to take these these findings and and then understand your environment and make it your own so there's got to be some customization required here all right so we made this feedback problem disappear and we started listening to the development and the delivery teams and and we got their perspectives and so this way we were able to drive adoption of the build pipeline automation that we had put in place and we started to reduce cycle time well the results are shown in the developed journey tracker that I have here we we started making good progress in cycle time and and even the customers NPS scores this is what I would call the functional domain meaning we are able to operate in a sustainable manner and we are able to deliver products as planned well almost as planned but kind of functional a lot of companies are stuck in this place this is like it's okay but it's not it's not quite good but then on B dash or the point that I've marked in the middle something started to happen our performance started to degrade we were not that productive anymore and we were confused as to why is that why is that happening so we dug a lot deeper we conducted some root cause analysis and figured that all right things that we thought were good solutions turned out to be bad like how is that like well we talked about cross functional teams right if you know Ticketmaster it's a very siloed organization so we have a massive teams that are broken down into specializations meaning we used to have let me let me list all the teams in one breath see if that's possible so we used to have DC ops team data center team cloud team infrastructure development team a platform team development team QA team business analyst team and so on so forth right so we have all these teams and they are specialized in their areas and they don't know about the technical difficulties of the other domains so now you have to form a cross functional team how do you go about doing that so we picked one from each team and then said okay you guys work together you're the cross functional team and then for ridden then see when some guy gets sick or takes a vacation or whatever we just want to double the number just for high availability right and the team size became massive Jeff Bezos point was like two pizza size team ours was more like 10 pieces size team and it's also extra large pizzas so so this wasn't going to work the team was so big that the team members would not know each other's name it took a couple of weeks to just to get familiar with the people's names the funny thing is the card wall was for so long we used to call it the great wall of cards so anyways it was chaotic it was dysfunctional it started to get dysfunctional so we had to break the teams down into like smaller teams so we experimented with a couple of models we tried Scrum of Scrum's model and we also tried independent parallel teams and we kind of settled for something in the middle wherein we had the teams broken up into infrastructure service and then that will serve the next team which is platform as a service so they add their component and then it goes on to the product delivery teams and between these teams there were unwritten contracts that okay I promise to deliver this feature to you or I'll give you this code base with this functionality so you can build against this interface so we looked at the microservices architecture and the best design patterns out there to enable to learn from some of those things for example we have tolerant reader we have consumer driven contracts and those type of design patterns microservices domain we just adopted the same thing for team designs so the results were course corrected and it started to get a little better then the next one is listening to perspectives like what's wrong with listening to different perspectives right everybody should do that but we started listening so much that we kind of became DevOps psychologists if you will so we started listening too much and if we wanted to get everybody's perspectives we want to solve the problem for everybody and we soon we figured out that a single solution is not going to meet every need people are going to have different needs so we said okay let's launch an initiative to really understand our customers better you cannot have 10 different customers and all of them are exactly equally important to you it's just not realistic a deeper understanding of the customer segments and what their needs are helps you understand and prioritize how important each customer is and then walk walk the list down within each customer again there are different requirements with different priorities things that solve bigger problems for the customer and things that are they just like to have so we had to break it down so that we produce incremental product feature releases that'll really help and and have a better deeper customer understanding helped us with this all right so at this point we reached the point C or the pivot C pivot C is characterized by I would say we are devopsy in the sense that yeah most of the companies try this by reading quick books like devops for dummies or get devops in 72 hours so we bought all those books read all those quick tricks we leveraged from benefits of all the low-hanging fruits out there we've gotten cross-functional teams we got we started using CIs basic things right we saw results but at the same time we were kind of not feeling well at the core we were feeling like how's this going to turn around a 40-year-old company how's this going to make it nimble such that we can deliver products quickly like a startup would right so this concern started to raise we knew that this there's got to be more change required for us to launch into this contributing phase what is that how is that different from functional phase it's a contributing phase is somewhere where you not only operate and deliver the products as per plan or within budget but you also understand and think from customers value perspective how much value you're adding to the customer and therefore you're delivering it so this type of value oriented thinking and needed a much deeper broader change okay so change is required but how much change is the question right well it is new year's time we all make resolutions so let's talk about a quick resolution apparently in a recent survey they say that health and fitness are top of the list that's the category where most of us make our resolutions followed by career growth obviously with this picture up there I'm not going to talk about career growth right so I'm going to talk about health and fitness let's say we all we all know that eating an apple a day is is going to give us great health it's going to keep the doctor or whatever but is it going to get a physical transformation like you expect if that's what you're looking for I was talking to my friend and he said okay if I let go of my other six pack every Friday will I get this six pack not quite not quite you have to change your eating habits your sleeping habits you have to exercise regularly you have to start counting calories and macros and whatnot the entire lifestyle has to be changed it has to be a much deeper and broader change all right so the point here is minor tweaks don't produce major results they really don't if you want something big then you have to go big all right so we want we said that we wanted a more cultural organizational change and that's when we discovered the culture of promises okay what is a culture of promise let's say you have a company vision I'll break it down first and then I'll give you an example let's say there's a company every company has a vision so you break that vision down into strategic promises and each promise then gets further broken down into functional promises and then when it goes to the team it gets broken down into tactical promises let's look at an example let's say our vision is to connect lots of fans to great experiences and if the strategic promise that maps to that is something like okay I promise to make Ticketmaster.com the favorite ticketing site for the fans you could break that down into smaller promises like to provide I promise to provide the best online and mobile experience user experience for our customers which could further be broken down into tactical ones like okay let's I promise to develop interactive site maps basically it's a feature that allows you to view the the auditorium or the stage from each seat so it's a feature and it could be other operational promises like I let I promise to build a site that always works just simply works high up time all right so so now you see that we have taken this one top level vision and I kind of broken down that into multiple hierarchical smaller more actionable promises what this does is there is several benefits to this first one it aligns the entire organization towards with a common goal at the up at the top so if the CEO and the senior leadership team decides to change the direction in the next quarter or next year then it's easier to make change at the top and then it kind of trickles down to the bottom everybody is now thinking about value and customer more importantly than they are thinking about their products so the code or their process everybody is metrics driven in this in this framework and more importantly you would have seen this in studies the best way to improve team morale is to make people feel like they're part of a bigger problem right so this helps the team members feel that way everybody even though the person who's setting up a virtual machine is just responding to the tickets for virtual machines on his queue even he now is able to think of the virtual machine as okay this is going to help me deliver the interactive seat map feature which is going to tie to the best online experience which is going to tie to the makingticketmaster.com the favorite site so the promise is kind of rolled up every task is now more meaningful all right so we tried this and then we got excellent results definitely external results or turnaround times were faster we not only were able to deliver the products as promised but we also understood whether it really helped the customers or not so business business value delivery was happening on a continuous basis then we reached pointy the pointy is not necessarily a pivot it's it's just an accelerator that brought us into the next phase is transformational meaning in this phase we were so tuned to thinking with customers in our center so that and not with our product recorder process as the center of our ecosystem that we are able to quickly pivot if we need to based on company strategic decision on time yet yet deliver to the diverse custom customer needs we were able to own the customer's problem and and always be keen on solving customer's problem rather than the products that we built the term product manager is a misnomer as some of us know if you call somebody a product owner then after a while they start loving their product and what they created right but if you call them a problem owner of the customer and the goal is to solve the problem for the customer then that the focus is different now they are putting the customer in the center of the equation and and they produce whatever products whatever it may be version one two three whatever to solve the problem for the customer so this is a this is a mind ship so because our organization was already set up for this structure by this time because we have been practicing the culture of promises for a while we started seeing some cool transformations that that were organic for example let me give you a few examples so we made a promise to say okay let's free developers from the constraints of on-prem infrastructure well they have and they have a need they submit a ticket they they wait for X number of days depending on the SLA and and and they get a server but at that time they don't have it properly connected no they have to restart the whole thing it it it causes too much delays so we don't want to constrain the developers so that the team now comes up with this idea of okay why bother with this massive infrastructure and how the constraints here why not just go to AWS or web services right so that's exactly what the recommendation that came out of the team it is groundbreaking for taking master because like I mentioned we have a massive infrastructure and we have a big team that runs this and for them to like accept the fact that okay we're going to go to the cloud because it delivers better value to the customer that is significant that that is that is a huge shift the next promise I wanted to find out is reducing cycle time of the software releases so we said we'll believe we'll build a continuous delivery pipeline now as you know continuous delivery pipeline has several stages you check in the code and then it runs the unit test and then integration tests and then spins up an environment using scripted automation and then it runs the user acceptance test and then it takes on to further performance testing or security scanning or whatever and the code ends up in production all in a fully automated fashion but it's the same series of steps that takes your code through all these steps right now we built a framework actually we are in the process of building a building a framework that lets people plug in their own custom build script in there and their own unit test frameworks in there but we still have that high level unified build pipeline or the continuous delivery pipeline for the entire set of 150 products now what does this offer now you have a common scale using which you could drive tech maturity well how's that now you could say let's say all the teams are using the same build pipeline continuous delivery pipeline now team A might have 95 percent coverage in the unit test step well you could measure that because it's the same standard pipeline right but team B has only 45 percent and you could do the measurement and use metrics for the various stages and throw up a dashboard that says okay team A has higher level of maturity in adopting or practicing continuous delivery because their coverage is better they have more automation their environment creation is scripted and the cycle time is highest and so forth so you could use that as a model to motivate the rest of the teams and drive everybody towards a higher level of tech maturity the next one is we promise to eliminate environmental inconsistencies obviously Docker came up so we wanted to contain containerize our applications so besides the obvious benefit okay eliminates environment consistent inconsistencies we have we have this other unique benefit that comes out of this remember we talked about migration migrating our applications to AWS the team not only proposed that they came up with this idea of creating a cloud enablement team that will go ahead and build and on sorry a self-service AWS onboarding kit that includes things like cloud formation templates the VPC layout and the connectivity all established and proper recommendations for technology choices with an AWS product let's say it's a package that's easily consumable for the product teams now each one of the teams can just go through that and get their products up in the cloud so this enables us to move faster to AWS on top of that if we Dockerize the applications on the on-prem infrastructure while it's running in our private cloud then it's much more easier to migrate to the cloud so Docker was again a beautiful transformation that we recently figured out so anyways so this pretty much sums up our journey in this direction and as you can see we have come from all the way being pre-developed in the pre-developed domain to now in a transformational phase where we are willing to see past the technologies that we'll use and look at what's more important at the end of the day which is customers and customer value so that's it that's pretty much what our journey is if it were a Bollywood movie this is the place where they would roll the credits and sing the happy song and dance but I'm going to summarize the key principles for you to take with you the two things mainly first one is automation drives throughput right so you don't want manual operations in your day-to-day operation things that repeat most often obviously you want them automated because you get the best bank for the buck so you drive for more automation as much as possible and then automation if you produce crap then your automation just produces more crap so you need the feedback loop right the feedback improves the quality of your throughput now there are a couple of examples of different examples that I want to give the first one is seeking feedback improves empathy for the dev and the ops team to understand each other's perspectives if dev teams understood how difficult it is to maintain and operate a software after it goes to production if they understood then I had that empathy then they would definitely build applications that are HA easier to operate easier to back up and whatnot easier to monitor right so asking for feedback improves empathy the other one is we should always seek to use operations as a input in to close the loop into the strategy for example product managers when they create a feature and make a business case for building a feature or a sub product or even a big product they make a lot of assumptions they say okay if I build this feature the customers will come to this website and click this button so many times and therefore expect so many dollars to come out of this every quarter so they make they make these assumptions in the beginning of the project and then we go ahead get funding approved and then we build the product product goes to production but very few companies actually take actual measurements that will validate the assumption that were made in the ROI phase and feed them back to the product management product management pretty much runs blind that they they rely on quarterly finance numbers to actually find out much later on whether their assumptions that they made in the first place were right or not so operations teams can help enable that to close the loop okay how many of you know this guy is Alvin Toffler he's one of the famous American writer and he's also a futuristic thinker he's written several things about like revolution in terms of technology and communications he's the guy that said change is not a mere necessity of life he said change is life but the thing that caught my attention was this one he says that the illiterate of the 21st century will not be those who cannot read and write but it's those who cannot learn and learn and relearn what it is is this is an age where the smartest guy or the loudest guy in the room is not going to win it is those guys that can quickly learn and adapt to the changes that they see that's going to win to sum it up I just want to say that when you go back please encourage and try to create a culture of learning where every team kind of feels like it's safe to make some honest new mistakes as they try to push the boundaries of technology because when the environment is not safe or it is not suitable for learning what happens is then engineers start repeating the same thing that they did yesterday because it's safer and you don't get much learning in that process I hope I answered some of your questions thank you and I'd be happy to take more questions go ahead yeah it's a great question let me repeat it for the recording the question is so you showed wide range of technologies in the tech stack and as you do automation are you trying to work towards converging that tech set obviously yes because we do find that it is very wasteful to have like duplicate technologies for example you don't want chef and puppet both running in the environment right we have an internal config management tool like I mentioned and now we are starting to adopt chef so one is going to replace the other obviously we're going to use open source tools so so that's the direction we are going but it takes a lot of commitment and because we have 150 products or so it takes a while to get rid of legacy technologies and replace them with modern ones but that's the direction yeah these cross functional teams usually are self organizing and they usually drive the technological decision making that is we don't practice the chief architect that reviews every single design and approves the design we don't practice that anymore instead we have communities of practices meaning the architects who are interested in a particular topic they get together on a regular basis and they discuss what's the best practice what's the modern ways to do this and they talk peer to peer and make decisions most of the time all the technical evaluation as well as selection is done by the corresponding teams yeah please so so there are several different frameworks that enable one to implement the cultural promises one of the frameworks that we adopted is ThoughtWorks Lean Value Tree Framework you can look it up online it basically goes around by breaking down the vision into strategic promises and exactly the way I showed but they have constructs and proper artifacts in place that helps you and they have a step by step of rolling out such a program into your company we had to train a lot of people on what Lean Value Tree is and we got a lot of pushback to be honest with you not everybody is gonna you're not gonna have the same 100% team past this cultural rollout there's gonna be some resistance there's gonna be some churn you gotta plan for that and be more realistic yeah yeah so so there are two questions that you asked the first one is did you just go across and like instead of one team did you go deeper into a product or did you just say everybody just go ahead and use this that's one the second one is a question actually the second question let me ask you again once I answer the first one it depends on the solution we were trying to solve like for AWS migration as one example we want all the products to migrate to the cloud and we have an aggressive timeline to make that happen so we created this cloud enablement team that creates this onboarding self-service package so how do we know this package is gonna work so we partnered with one of development team whose technology is suitable for to be the first ones and we might we help them migrate we are helping them migrate to the cloud as we are migrating them we're creating this self-service package which is our product now we're gonna go take another product on the other side of the spectrum and test the same self-service toolkit at that point we know that okay two different use cases have worked so I'm gonna now open the gates and say okay product team just go it helps you just self-migrate your own product see you in the cloud so it depends on what actually we're trying to roll out we've tried different things and this is what seemed to work for this occasion do you please repeat your second question um if you remember the the page where I slide where I showed the hierarchy of promises note that I didn't start it off on the bottom I went from the top and that is that is how the model works like you cannot have a bunch of teams trying to drive this cultural change upward it's gonna be extremely painful and won't even go anywhere and we have 80 plus development teams so it's not gonna go much further our change fortunately happened from the top so it was the team of senior leadership executives sitting in the room wondering okay these minor changes minor tweaks that we are doing is not gonna give us that much results or that much cultural change let's do something big let's go all in and that's when we brought in the lean value tree framework so fortunately it's from top down yeah yeah exactly precisely yeah yeah it needed a bunch of these little school of fishes out there trying to eat the shark we needed that kick in the back to like get something good going please yeah so very good question how did you define the KPIs and how do you go about like tracking and making sure it's in the process you're constantly looking at it like I mentioned the lean value tree framework does exactly that in fact we have several artifacts that we identify the goal leads when I say these are different the strategic promise leads the functional promise leads and the tackling promise leads they are supposed to fill a template like this is a framework on how you how you roll out right that template has one of the entries in there is KPIs success factors right so the team leads they have to think through okay how am I going to deliver this value and then they have to come up with ways to measure it every cycle that we review that that particular goal or the project we will review the actual metrics and we have to show the teams are supposed to show the trend or the numerical change and progression it's okay if I mean it's not possible it's possible to expect the graph to be always like a hockey curve graph let everything is improving it's not the goal of looking at the metrics is mainly to find out whether the methodology is working or not it's not really to make sure that it's working so if it doesn't work totally fine you just move on to the next idea and see if that works that's the culture of learning I was talking about currently we don't have any special tools that captures this in fact this is not this is a new framework so there's not many out-of-the-box tools out there that helps you do it we are currently just using Excel at this point which is works sorry oh absolutely yeah so A is like 10 years back right and then B started roughly three years back and then you could think of it as one year one year one year right so the A to B is the one that we had the like I showed the graph like dropping so fast it was not that bad it was more gradual in fact that those are the tough ones to track because every day you're looking at the mirror you didn't lose that much weight but then if you take your picture from two years back you were like 10 it's like what so that would happen you don't realize the gradual changes as obviously as you do the drastic changes all right thanks a lot for coming hello I guess I'll get started here okay well thank you for coming I know some of you are hungry and thirsty and so I won't take up much of your time I'm not a technical expert so if you do have any technical questions I can take them down and get you to the right person or get an answer back with you okay so my name is Douglas DeMaio I work for OpenSUSA I've been there for about a year I'm fairly new to Linux I have a background in military so my wife's German so I don't have much of a sense of humor so bear with me all right so this is a basic overview of what we'll talk about today with OpenQA OpenSUSA if you're unfamiliar with it is a community project that's sponsored by SUSA it has two distributions and it has multiple tools a lot of people will ask you know what is the difference between SUSA and OpenSUSA OpenSUSA is a community version and SUSA is the enterprise version so why automated testing well the answer should be fairly obvious right maybe that's why there's not too many people in here but you know we're talking about innovation we're talking about progress we're talking about getting driving technology forward so this is just a brief summary of the changes that have taken place using OpenQA in the past two years you can see how many changes have taken place in SLEE12 you're talking 600,000 packages added and nearly 2080 removed right same with OpenSUSA 13.2 very similar numbers there and then our latest release OpenSUSA LEAP 42.1 also with very similar numbers some might ask what's the difference between jumping from 13.2 to 42.1 I kind of just want to clarify that because I'm sure some of you might be thinking why that happened so Sousa or OpenSUSA in its latest release used a lot of source code from Sousa Linux Enterprise 12 ServicePak 1 and they used that as sort of the base and then community built packages on top of that so it's it's a little different than CentOS it's more dynamic there's more of an effort from community in that base and Sousa has a long history of starting something new with 42 reference to the meaning of life yeah so YAST was its first version was 0.42 and Sousa's first version technically was that they released was 4.2 so I hope that answers the questions that you might have in your mind all right so yeah change is good right change is good because you know you you fix security issues or you enhance performance you you fix bugs you have a newer hardware that you can accommodate newer features you know upstream upstream is very fast paced and and so you want to be moving just as fast as them change is also bad right so you get new bugs and you get new security issues you get new functionality you know some people understand change some people can adapt others sort of live in the past right okay so automated testing helps open Sousa develop fast and you will see that in the in the slides that come a little bit later with tumbleweed I'll give you some basic numbers and you can see how fast we're actually moving with upstream what are the problems with the other testing tools so how many are you how many of you are actually using some testing tools that are out there right now solarium anyone solarium okay cucumber no jankins I hear a lot of okay jankins heard a lot of people use that so so what are some of the issues yeah well this is just the basic list of it right so some don't render well some are entirely web interface some focus strictly on packages they lack a gooey really there's nothing out there that compares to open qa so open qa to the rescue a little bit of background this started out as a hack week project in 2009 by a guy by the name of Bernhardt vitamin and it's developed over time that was something that they continually focused on on the hack weeks that followed so we eventually got to a point where where we could start to integrate it and when we looked at integrating it you know this is sort of the dev ops issue and discussion the person in here previously was talking about is it's really a cultural change right so you need to look at I guess you're at least with open susan we looked across the board we saw how how our processes worked and we thought well how can we integrate this in and and so that's what we've done is we've we've taken this tool that we've developed and added it to create and enhance our dev ops this is sort of an overview so testing testing testing you'll see the submission come in gets an automatic review goes through a pre-integration tests manual reviews and then it goes into what we call factory and within factory you're looking at multiple staging levels of tests so this is where some of our processes has have been enhanced there's a huge flowchart in my co-worker's office and it if you look at it it just it doesn't make a lot of sense I guess when if you don't really know the whole thing but you have the yes no answers and it all kind of cycles around and you could start it or you could be at the very end and get a no and it'll shift all the way back to the very beginning to run through that process but as you go through some of these testing phases where we use open QA and within factory and the state and certain staging areas when we get to it to where we can get to the final test and everything's worked out and it gets QA approved you get a release or an alpha or a beta whatever you want to say whatever point you're at in your in your development cycle so a little bit of overview is about the about open QA it doesn't really touch the software it really acts like someone is actually doing the testing manually and so this is this is a great benefit you're talking about opening up all different programs you're talking about testing the console a variety of things you're actually you can you can create a test with open QA to basically do what you want so it's it's an installation yeah installation testing it's from beginning to end but really when it hits a failure doesn't need to go forward right so it just stops right there however you could potentially if you wanted to in the scripting ask it to continue if you'd like so it gives you a graphical interface you can see on the left there you get the green means it's passed and it works on to the next level right um one thing I would point out with this so in in your development process you know when when when we have something like our rolling release tumbleweed you get a lot of version changes right so so you might wonder how do you account for that in your testing scenarios well basically you have these green boxes that you can focus on certain areas that you say hey I need this to be a hundred percent that this this area cannot change however you have version numbers that might change you can give a a percentage of acceptance of change so you can account for those little things that would happen and obviously if you're doing routine cycles right you know those little things would matter and so that was thought of so all right so you can see the script running too that's another good thing below this if you actually go to the open qa website you can track a live test that's taking place it's in you'll see a little yellow button and if you click on that you can actually see it see it moving and and see it the text the test taking place and see the script below it and how it's actually being tested yeah move on so this screen as I said I'm not a technician so to some of you you might understand what this is but this is supposed to be from what I understand an output of the previous screen so I'll give you a chance to look at that move forward so this is sort of the architecture the way it's set up we have various workers goes into a pool and it gets I guess it's call up and and sort of waits in line because you'll have multiple tests taking place and they're happening every day I mean there's there's tests every day going on over the over the day over the night and they all kind of wait in a queue and then eventually when it's their time they they go through testing and you can you can adjust that too to if you need something that you need to rush through you can I guess subvert the other ones and and go to the top of the list if that is what you need to do so it gives you a quick emulator it it supports so with ours it supports multiple architectures which is is very helpful for a lot of people right because because they're working on on several different architectures or we're working with so many people so we we have to make sure that our stuff is running on a variety of variety of hardware and it does run on bare metal so most of it is done tested most of it is tested in a VM but you can do it on on bare metal this is sort of the dashboard that you have from it where you can see all your tests and it basically this is in the development cycle of of a distribution you could you could see that I guess years past right you get six month development cycle fairly fairly quick at that time I guess but you might have introduced something into it as you're building as you're building your distribution and you're toward the end of of it and and then you introduce something else in there and then you find some bug or some issue and and and in the past it was kind of difficult because you go where did this happen you know it might have happened like four months ago it if you got a track and go back to all that well the benefit is you know you get to keep all this information in your servers you have all that data you can go back you can figure it out fairly quickly if you need to but a lot of the testing um you'll see it right away you'll see the issues right away so there's a failure that you can see right there on the left you'll get that you'll get that red check mark and basically it stops from there so who is who's using open qa well obviously open SUSE is using it we use it for tumbleweed and how many people here are familiar with tumbleweed how many people here are familiar with raw hide okay so tumbleweed is what some of the people that I work with believe is actually the future of Linux distributions and that is that the latest software is being tested and coming out very quickly so it's it's the latest packages the latest applications it's it's just that's what tumbleweed is you you have the latest and greatest and it's stable and it's tested and that's that's the most important thing that I'd like you to take away from from that idea of tumbleweed it's tested a lot of people will compare it to arch and you know arch is good you know but you have to compile arch right and there's a lot of stuff you have to do with tumbleweed you don't tumbleweed you get snapshots and it's tested and it works we used it in a leap and that was that was a quite a a difficult thing right to bring two source codes together but we used it for validation testing system testing SUSE is using it they're using it for SUSE Linux Enterprise 12 they used it for 11 I believe 11.4 was the first one they actually started using it a little bit and red hats using it the person who wrote this who's our QA engineer Richard Brown and the open SUSE chairman one day he was a little bored and so he thought well let me let me see how how easy this is to make a test and so he wrote a test for Fedora and in doing that test he or in writing that test and creating it he ran it with Fedora and he didn't do it to kind of yeah like I I guess what what's the word just not be confrontational he didn't do it with any intention like that he wanted to take a distribution he knew but not not so well and just write something for him and he did and in that test it found some bugs that Fedora had and of course he knew the the release manager for Fedora and he said hey I just wrote this for you I don't know if you want to take want to take a look at it and the release manager was very yeah thankful and he said yeah okay well so they started talking about it and they started collaborating and now you have Red Hat using it and Red Hat is using it and they're contributing back to it which is the most important right because if OpenSUSA has some has some issues and Fedora has issues you know you have certain bugs well in a lot of ways right there they were late so we're kind of working together and we're we're helping each other and that's I think that's a positive good news story for for the community and for Lennox and you can see right there in the bottom there that's the release manager and I did that I took this screenshot just a couple hours ago but they so they used it five days ago and they're using it with Rawhide you know that you might see that progress develop a little bit further based on based on them how they integrate it into their processes I don't think they're there yet but they I'm sure they will get there soon so OpenQA and Leap I've briefly explained a little bit a little bit about this earlier you know you're taking source code from SUSA Lennox Enterprise and then what we had in OpenSUSA is sort of the development going forward and really bringing those two together goes through it went through extensive integrated integration challenges right and you're looking at over 25 different installation upgrade scenarios and OpenQA allowed us to do that you know you have an OpenQA engineer he's twice as productive right he's running his tests he's doing his manual testing this OpenQA is just a really really beneficial for everyone and as I mentioned earlier we have it in our processes it is an integral part to OpenSUSA development and really it's allowed us to to be very fast in our releases so the very bottom there testing testing testing new tumbleweed releases as I said a rolling release they're expected about one every two or three days when we have the when we have the building of of our regular release it kind of it kind of slows down right because because areas are focused in other other places but it still does move forward during that time it's just we have to dedicate more resources to to the development of our regular release so here's an example of a quiet week for tumbleweed this is something I guess has happened back in April or something like that so they had three snapshots 146 package updates 15 new packages packages on the DVD and we're also removing that some of those packages right so we're cleaning up after ourselves and one new kernel all tested and 118 different installation and upgrades this is this next one is a little more interesting right so and this is what we're talking about with development you have five snapshots 298 package updates 47 new packages on the DVD 42 packages removed two new kernels and it was sort of just another week as someone described it when they when they did this and this was again this was back in spring but we did the same thing last week so we had five snapshots within tumbleweed last week yes I'm not exactly sure on that number I mean I could explain the top ones but that one I'm not exactly sure do you know George do you know anything but so yeah as I mentioned earlier the first one the first time they started using OpenQA SLEE did or sorry SUSEA did was in SLEE 11 SP4 and they integrated that into their later processes with SLEE 12 which was released last year in SLEE SP1 which was just recently released so you how you incorporated into as you do it with some pre pre-validation and staging so and every part I guess of the development you're kind of looking at OpenQA and you're using it to your advantage so it's inside our processes is inside SUSEA's processes and they don't with SUSEA they they don't really show their their stuff you know it's sort of hidden in the back but with OpenSUSEA you know we do make those tests available for everyone so over 20 over 100 different validation scenario test I think I would think that would kind of be along the same lines as your question earlier you're using it for we're using it for alphas, betas different different scenarios matter of fact when we released when we released LEAP I think we just went straight to a a beta I think that's what we did and you can use it for post-validation as well so go back OpenQA and you so if you're if you're thinking about using OpenQA I basically there basically here's some some information for you on how you can either contribute or learn about it or contact someone that can help so uh the documentation is available the bug reporting is available at that website and it's located the main project is located on GitHub as I mentioned earlier you know OpenSUSA is transparent they're making all their test test publicly available for everyone so everyone can see it and the best way to start and think about how you would describe the steps for someone someone else to do it so if you're writing a test you want to think about how am I going to get how how am I going to use this right and so you want to write those steps down assume that that person doesn't know how to do anything and then use those steps to help create your test code write the test code for each step and you also want to think about what you can do with OpenQA to enhance or develop your processes to aid with the DevOps transformation if that's something that you're here and thinking about with this presentation this is some basic background sort of I guess advisement whatnot yeah organization I it's a British guy that did this so there's no there's no z in it yeah so I just left it the way it was I don't know if that was if that's necessary or not and here's some other coding actions for you I guess here's a example test console it's for the console this next one's for the graphic and so as we're looking at at that we see how we're open using OpenQA with their distributions but it has a lot more potential than that right so here are the some of the things that we are actually looking at doing the thoughts the thoughts are already there but the they're actually working toward this right and my understanding is that possibly they might be looking at salt stack too so and that was pretty much all I had for this presentation if you want to find out more you can go to that website and you can contact either Richard Brown at susah.com or hop on the IRC OpenSUSA factory and all those guys will will be able to answer any question you have pretty much right away they're they're on there all the time and and they'll get back with you on on any sort of help you would need Richard Richard's very helpful he can he can answer any questions you have he does it as his day job so what's that he's the guy that made the slides and yeah he's the British guy the slides no I don't think they are I did send them to scale but I did make just a couple of changes not not much I mean the content is for what he did was was there so so does anyone have any questions about OpenQA is it being used to test anything other than the installations I believe it is I think that was actually in the slide let me see if I can find that I I I did see that somewhere you can use it to test you can use it to test network cards you can use it for basically a lot of different things open cv yeah let's start bring that up to read the actual screen and output and compare the predefined needles that isn't the question okay okay I couldn't answer that question for you I wouldn't I wouldn't be able to well I mean I don't see why you couldn't make a test for that I mean if it's I can't tell you if we do that but but judging from what I can see in the documentation and and the way that they've discussed it it's you can test for pretty much anything you want it's just a matter of actually writing it right and that that's brings up a valid point if they hadn't thought about that but I'm sure they probably have but who knows you know I I can pull up I can pull up IRC and and ask if they would if they would know that I mean probably everyone would on that oh that's that's actually a test sorry yeah that's that just no that that was something I had running in the room before before people showed up so they could actually zone out or look at it it wasn't there was nothing no intention behind that other than just give them something to look at rather it is yeah did you want to see it yeah okay let's see I don't know if it would be in doc again I'm not a technical expert I'll go to the documentation and see if there's anything that it's would be all the information needed to install and set up tools here I start a guy general operate that's probably in that starter guide I would think I mean I really yeah I I couldn't really tell you I know we have our security team so that I know they look at stuff as as far as like incorporating security tests in it I don't I don't I don't really see anything but it's an interesting thing to bring up and I I can ask Richard and see I give you my card you can reach out to me and I mean those that's a good idea right why not I've heard of it uh-huh yeah yeah that's I'm sure that that would I'm sure bringing it up would probably make someone do so yeah any other questions all right thank you