 So I guess we're gonna get started here Thank you guys for coming It's it's the tail end of the conference some of us are here tomorrow But for a lot of you you're you're probably this is probably it for you So we're not gonna take very long here. There's two reasons for that one. Like I said, it's the end of the conference and Two because we did a lightning talk on one of our clouds earlier this week And we had several questions. So we want to make sure that there's room for enough questions afterwards So I'm my name is Joe Topchin. I'm here with Michael Jones this guy here We were supposed to have a third speaker Barton satchel, but he couldn't make it. So it's just gonna be the two of us We work for a company called cybera We're located in Calgary, Alberta In a nutshell we play around with new technology and try to figure out what it's good for things like robots and SDN Clouds which you'll be the focus of this talk That's where we're located on a map We also maintain the Alberta portion of the Canadian research network it extends to to all parts of Canada and they're all maintained by Organizations similar to us and the names of those organizations are up there So us and clouds The first cloud that cybera worked on was was called says web that stands for the cloud-enabled space weather modeling and data assimilation platform I don't know how they Cybera started with eucalyptus on that cloud, but they ran into several issues and Open stack was started very soon after that. So cybera jumped to open stack This was before I was actually at cybera. So I don't know too much about this cloud I came to cybera with dare dare is the digital accelerator for innovation and research We then created a cloud called the LMC cloud learning management cloud We also have a cloud called the VCL which is actually We use the VCL software from the University of North Carolina that was mentioned on the previous session And the latest cloud that we've launched is rack the rapid access cloud. So we're going to talk about these four here Dare So dare is a federally funded Canadian public cloud It provides a free test bed for researchers small to medium enterprises Who are pre pre pre-production phase? Users of dare get four cores four gigs of RAM and four instances and 200 gigabytes of object storage for free if they want if the users of dare want more Resources than that they can pay for them and the fee is quite nominal Dare is physically located in two different parts of Canada Alberta and Quebec They communicate to each other through the Canadian research network that I've mentioned earlier Use cases of dare It's been used for a data analysis mobile application development games game simulations and some high-performance computing The open-stack details of dare during the pre pilot phase. I'm sorry during the pilot phase in 2012. We use open-stack cactus it graduated pilot in 2013 and we installed grizzly which is where it's at currently and We have plans to upgrade to Havana and then a quick jump to ice house later on this year Some internal details about it we use keystone in the template catalog We use glance with the file back end We use no vid network with vlan manager We use cinder with a net app to book net app appliance and Swift is just a vanilla install of Swift There's two regions in dare back when we launched with cactus They were actually called zones back then but we now use our regions The users and images are shared between the regions using a single database table And then we have created a couple custom scripts and Cron Damon event the event listeners to balance the quotas between each region So if someone launches an instance in Alberta, that's taken off of their quota in Quebec Lessons learned during the pilot don't be cheap on network equipment we started out with some very cheap 10 gig switches and Those switches crashed a minimum of once a month probably more like three times a month We were using ice guzzy for a block storage and every time the switches crashed. We had a lot of data corruption issues since then we've learned to To buy decent network equipment and all of the clouds since then we've been using Arista switches arista 10 gig Sfp switches and they've been very awesome to us Because of the data corruption issues We had a lot of user issues with With more data corruption So we learned a lot of lessons from that and so when we went into production We ended up going with net app for a central storage appliance The net app also gave a centralized storage so we were able to start live migrating users which helped us with with our SLA that we support Which is 24 7 so the dare cloud is under 24 7 SLA Now the next cloud is the learning management cloud And the address for that is there on the screen LMC is a three-year pilot with Alberta post-secondary institutions. So anything from colleges and universities The goals of that cloud of that project were to explore cloud-based shared computing resources Explore infrastructure automation as an alternative to platform as a service and to find out how efficient infrastructure automation can make us An overview of LMC it began in production in 2012 it's had a 99.97 uptime. There's four different institutions Using it with more coming on board Inside the cloud there's a total of 270 servers in 18 different environments There's two full-time employees supporting it and there's also two full-time employees extending the system The point about support is worth commenting on Because one of the one of the goals was to continue to develop the system after the cloud went live And that was a very different way of thinking for the universities who who had their systems in place They didn't want anything to change And this this idea of continuously evolving continuously updating the the environment was was very scary to them Use cases for dare all of the universities the four institutions that that are currently on it They're hosting their learning platform infrastructure such as Moodle the supporting infrastructure such as post-escue post-escue L varnish and log stash are all hosted in that cloud and As I mentioned before there were 18 environments, so it allows them to have different development and production and pre-production environments as well Lessons learned how to migrate to the cloud that goes back to that in the how to automate leg legacy systems goes back to how the Institutions were pretty scared about having You know the bare metal servers that never got touched never really got updated moved into a more continuously evolving environment And it challenged their traditional policies and procedures One of the terms that was coined by the LMC team was just in time complexity So the tools and automation that the LMC team did that they created rather They created them in such a way that they could be run as a whole or they can be split into Little atomic pieces is sort of like the Unix way of thinking where you have one tool and you do it Well the tools that the LMC team created were sort of collections of this wrapped around chef and various ruby scripts They call it just in time complexity So if they wanted to do an entire environment build they could do it But if they only needed to dig into it then they could take that tool apart and run it just as as it was Prescriptive error messages is another big thing that the LMC team really put a lot of effort in so when they were getting errors It wasn't just a stack trace or anything like that. It actually it gave a more Human readable error. It told them different pointers on what to do where in our internal docs to look at And you know other things to try And the last two points taking mature attitude towards errors and MTTR over MTBF Was basically the the institutions that were part of this they really didn't like it when we said this server came This server went down. We rebuilt it. Everything was fine They they wanted no type of errors on the reports that they would have to take upstairs or anything like that, but over time They understood that errors are okay. It's more how fast they can recover for those errors Traditional audit and compliance systems assumed unchanging systems that was a That goes again with that type of point difficult to reconcile with dynamic systems Tension is healthy and keep things in balance compromise and trust is necessary and trust is earned so That in a way the traditional way of doing things where there was a monthly change window things like that It really didn't mesh well with the continuously evolving part But there were some good pointers to that sometimes With the continuously evolving if there was a change once a week once a day it might take on a bit of a cowboy nature and so the two sides the LMC team and the universities sort of compromised and learned from each other so the Constantly evolving way sort of learns to sort of take it easy once in a while while the The unchanging portion of it Learns that it's okay to change after a while And with that Michael is going to take over for the third and fourth cloud with VCL Hi So VCL was a pilot done with the University of Alberta and ourselves for two and a half years The pilot itself is just wrapping up with the University of Alberta taking it over the whole point I'm not sure how many of you are familiar with VCL or not Virtual computing lab, but the idea would be it's been up a bunch of instances in place of an actual lab Students is that the students can actually then go use Our goal there was as Joe had mentioned earlier I was originally running on ESXi and then we ported it so we get to work with OpenStack So we had to modify someone had written a VCL module to integrate with OpenStack. It wasn't perfect So we did end up Modifying that one of our biggest issues was we don't have a lot of IP addresses Available to us so one of the things we had to do was we needed to create a NAT patch so we could use eight IPs instead of the 160 or so that we estimated we were going to need for the lab and the other thing we That was done was it was moved to the to OpenStack's Perl API And then additionally VCL expects a certain Layout whereas OpenStack provides a much more dynamic layout for example VCL When you're creating a new computer is always expecting the same IP address OpenStack Doesn't believe in that so much So we had to create some patches and such to work around that which has made VCL a lot more flexible And so one of the things we're also looking at doing with it Is looking at integrating with some of our other clouds Continuing on with that It does only have the one use case on the cloud so we couldn't It made over committing a little bit more interesting and that memory wasn't as much the issue It turns out disk IO wasn't as much the issue, but we were some more CPU bound Otherwise the VCL use cases that we were actually seeing there were economics classes, statistics classes, education classes There are some renewable resource classes a lot of them either using math programs or Custom modules that bolted on to excel and that kind of idea. So there are 43 courses and 1800 students and on top of that was Creating all our golden images and making sure that licensing would keep working and that kind of idea So the actual OpenStack cloud itself behind the scenes was running Essex We had Ubuntu on the hardware and then CentOS was actually running VCL itself So we had eight nodes 64 cores 128 gigs of RAM We did have SSDs in there, but again going back to what I previously said it turns out disk IO wasn't our major issue Which is what we thought it was One thing to note Which I'll go get in a little bit is it would take about 10 minutes or so for a Windows VM to come up And our problem wasn't actually at disk. We weren't disk bound As for configuration management, we do like to play a lot with configuration management at Siberia With VCL we started with Ansible. We did try moving to Chef and at the moment it was moved back to Ansible Going back to the challenges I was mentioning earlier VCL expected private IPs. We had to put in the NAT part One of the interesting bits with VCL is when it goes to clean things up, it'll start deleting nodes We've had it twice where it deleted the same management node and that goes to where it's same expectations in that With VCL it was expecting to be on a public IP and it could talk to every other node in a public IP But with everyone in one tenant inside OpenStack, it was actually allowed to delete itself So we ended up having to work around that Here's just a quick graph going back to the actual IOPS The big thing to note here is if you notice the top of the graph is only 8,000 IOPS Our RAID 0 of SSD can provide 50,000 without blinking. Our problem was not disk IO like people were thought And so while we had done a lot of work to try to improve the boot times on the VCL images What most people thought it was that's what it wasn't So the biggest lessons that we pulled out of our VCL one here double NAT can actually be surprisingly stable Golden image sprawl is always an issue And license management Well, it's it's it's there in the traditional way. It doesn't go away even with golden images, especially with Some programs if you're lucky, they'll use a license server which moves it away other times They have to be re-licensed every year, which means you've got to update your golden image I haven't forbid you update one golden image But don't update it into another and then use the one that doesn't have it updated It can prove to be quite a bit of work Going forward we're looking at integrating it with one of our other clouds And then we're modifying the the module to work with Havana and eventually ice house because it only works with the Essex right now We are talking to the VCL development team Well, I'm not actually part of the team talking to them So I'm not sure exactly where it's at but We're hoping to provide them some Test environment that they can use as well. Another thing we're looking at is something called virtual memory streaming In this case it was actually a product in grid centric. The idea behind virtual memory streaming Is that we can boot up an instance in about three minutes instead of ten minutes and otherwise other Other ways to better remote access the actual VMs because we had just been using standard remote desktop protocol so far But we're looking at things like core and that kind of idea And then our last cloud which is kind of our favorite cloud the rapid access cloud which we just Reimplemented two months ago This one here's in many ways is very similar to the way we have architected dare Although in a much newer fashion, so it's actually a free public cloud It has a lot of these same use cases is what you can find with dare In that it's but it's more aimed at researchers So anyone in Alberta whether they a startup just an entrepreneur or someone wanting to play around With cloud resources is able to just go in and sign up By default they're given up to eight cores eight gigs ram eight instances 500 gigs of block storage object storage is went online Earlier this week and if they want to they can email us nascus But one of the biggest things we're most excited about this cloud is every instance has full IPvC Sorry, excuse me full IPv6 access. So Again one of our biggest limiting factors is we don't have a lot of IPv4 addresses, but we have a whole lot of IPv6 addresses Again use cases we're actually seeing on there We've got several classes from the University of Alberta that are using them for Hadoop Classroom workshops both that we've run ourselves and at the universities There's been a couple hackathons by the various either incubators or just kind of hacking groups within Locally that have used it quite a bit And then the fun one was an email. We got yesterday was one of our previous users Their their paper just got published in science to do with researching on lemurs So going forward from there was to find out what land should be saved in order to protect the lemurs As for the actual details behind the scenes we're currently running Havana. It's based in two regions Edmonton in Calgary It's actually hosted at the universities We actually have two active active cloud controllers in each region each of the components of Nova, Cinder, Glantz, etc Running each of their own LXC containers on on the controllers Otherwise Keystone again very similar to dare. We're just running a template Catalog Glantz is currently using a file back-end and we We just are sink the files between the two regions every half hour or so although we're looking at using Swift instead It'll be a little easier Cinder we're using the Gluster FS driver Be because we didn't be because of the budget on this one We decided to use the hard drives inside each of the nodes and then created one large shared storage that way So using cluster that cluster that way And then we have and then just due to the high availability nature of the cloud We use a clustered version of RabbitMQ and then for sequel using Prokona with Galera The other big thing we're using Nova Network on this one We couldn't get IPv6 working with neutrons, so we're sticking with Nova Network You just use this flat DHCP It's not very large and again it has a public IPv6 address that's up and running it available The IPv6 address isn't actually configured through OpenStack. It's done through an upstream RA and That's about it for just what our four clouds are I'm hoping that you're able to get some details in the architecture Otherwise you have any questions the mics are available Thank you Who the cheap switch vendor was? Sorry, I was wondering if you guys would be willing to share who the buggy switch vendor was It's four letters long Do you want to say? well, it was so it was Dell's earlier switches which are no longer available anymore and so We have been in talks with with Dell about the past issues that we've had Dell was aware of the issues that we were having with the switches and so the actual models and What the switches were based off of I forget which company that that Dell purchased the switches or Bought the company they're just no longer available anymore So nothing against Dell or anything like that We all of the clouds that you just saw are actually running on Dell hardware about the switches itself that were older Dell switches We're loving our wrist switches Yes, so kudos to the V6 The other thing I want to ask is that last one the the rack cloud that you have there you said you were actually basically doing kind of a Sync for glance from one region to the other Who your tenants exactly for that I mean like how are you sort of guaranteeing that they're not accidentally Spinning up off of an older glance I mean do you guys care? Yeah, well if you could rewind a couple slides. Maybe I misunderstood how you there Going it's filed back in in our case like images are uploaded or changed Infrequently like we might see one kind of a day So we have aren't really worried about that there is a chance that said users are kind of made aware of that Ahead of time like if for example if they take a snapshot of an instance in Calgary We do tell them in the documentation that no it won't be available in the Edmonton region for a half hour So they they're expected to wait some time until it's actually available One of the reasons you want to move to Swift was it would remove it would remove that window that's open But in practice we haven't seen any issues Yes, yeah, you know first example you mentioned you switch to net apps I guess a shared storage or enterprise storage and You mentioned that there's reliability improved and improvements as well as Availability right which is in it enabled live migration in your follow-on project. Have you used? Enterprise storage anywhere else or are they pretty much captive internal storage? No, no, we haven't actually used an enterprise-grade storage like net app in our cloud since One reason is because of the budgets that were under with these different clouds Net app is sometimes just too expensive for that and the other reason is when we made the decision to use net app projects like Gluster and Seph were a bit too early for us to trust in production whereas Nowadays in for example in our wrack cloud as you can see we are using Gluster for different parts and if our SLA required a live migration type Scenario we would have easily used Gluster to just do the shared storage between instances so it's a mixture of the cost of the net app and The SLA and the open source technologies available right now that we haven't used net absence Just a little bit more about the I'm sorry go ahead Go ahead just a little bit more about the net app. Do you guys back that up right now? Do you have dual net apps or anything like that? Right, right? So each each region actually has two net app controllers and Yeah, yeah, so everything's highly available if one controller goes down the other one picks up the disks and the other shelves of disks Things like that. Yeah, and then going back to the VLC cloud What were the what was the profile of that? Profile in terms of like so what are you guys running is a is a van is a KVM. Oh, right, right? Yes, so it's a open sec Essex at the moment. It's a KVM for the hypervisor the hardware the Dell 6220s and Yeah, so with the overcommit ratio that you talked about Or the CPU being CPU bound Did you did you find it was the overcommit ratio and what did you guys do? To did you bring it down to one to one is it two to one? That's a good question So unfortunately neither of us are on the VCL projects on we're not too sure of the actual implementation details like that Okay, yeah, all right. Thanks. The good news is in practice. It wasn't It it wasn't that bad. It was usually we were bringing up instances That said, oh I'm not sure we're all I'll try to contact you after the guy who was working on it was actually here last Friday Giving a talk just about that and then he was actually discussing the exact issues So I'll give you my card and I'll get you. Yeah, I'd love to look the link to that Yeah, I love his brain because we have a very similar problem moving some workloads We purchased a company and we're trying we've been migrating their workloads Windows workloads into Opus stack and KVM and it's just been a really pain in the ass We're just we're just figuring out that part of it's the overcommit ratio But it's also we have issues with the images coming up in that but we just have just general issues with performance now So we'd love to hear more. Thank you. Thank you. Go ahead Question about that. I think it applies to the dare instance on the Nova Network Are you running one instance or are you running one on each compute host? For dare Nova Network is a single host for rack. It's multi-host so you're a multi-host egg and Are you noticing is that caused by just legacy or did you that because some of them are doing more traffic externally? With dare it was it was a bit of legacy. It was what we knew at the time when we implemented dare about a year and a half ago with With With rack it was we were more experimenting with it So we it was the first implementation of multi-host for us But we had the ability to give it a shot and see if it worked and also public IP addresses are also a big constraint on it As well for each compute node that we put in multi-host it consumes one public IP So that's that's also a huge factor for us. And what was the size of that? How many compute hosts are there on that one? Sure So for rack we have a total of 15 14 or 15 Yeah, 14 14 15 right now and dare is at 36 Yes, actually the other gentleman brought up a very good point on backup. So because you're Replicating between two sites. You're not doing any backup on those clouds in this case. You're correct both of them But both both dare and rack are aimed very much at non at non-production Workloads and such it's made very explicit ahead of time. We aren't making backups It's not so much we can't as it's just not feasible for what we're trying to provide right, so we give the ability for the users to snapshot to our sync their data off and and basically Handle the backups themselves Any other questions? Thank you guys very much Thank you