 I think we can start. Hi all. My name is Luis Periquit. I work for Ocado Technology. I don't know if any of you ever heard of Ocado Technology or Ocado Retail for that matter. If any of you have been in the UK or seen scenes from the street in the UK, you may see our colorful vans whizzing around on the street. But in reality, Ocado Technology is much more than just a retailer. We are, first and foremost, a technology company that does retail. These are some numbers that show the scale of doing retail and the scale of the things that we do actually on a retail. If you didn't know Ocado Retail at all and you spent a day at Ocado Technology, you would see us for or would be able to compare us for any other technology-based company and completely forget that behind the scenes, the objective is to do retail. These numbers show a very small percentage of the actual work that we do. This is just a tip of the iceberg. It is very hard to do retail. Just think on your normal weekly monthly shop. When you go to your supermarket, you don't take one item. You take 50. You take 60 items. Those items are very different when from another. You want your ice cream to be kept frozen, but you don't want to freeze your yogurt. And don't tell me you put your cans of coke on top of your bananas. That doesn't end nicely. And everybody likes to scramble their eggs whilst they're at home, not in the shop. So doing retail is very hard. Those items are very different. What about shelf life? Your bananas, your eggs will have a shelf life of a week, two weeks, a month maybe. Fruit within a month is bad. And when you want your items delivered, you need them delivered fresh and freshly picked. That's your vegetables. There are items with a very long life that don't care if they are delivered today, tomorrow, or next week, but that's not normal. And it is a cutthroat business with very low margins. Not on this venue, but I bet you go out of your offices during your day and you will find one, two, five retailers, all within spending distance of where you're working at. So to build successfully online, Ocado started thinking on the technology first and then working towards the retail part. We started building these very big warehouses. Trust me, they are massive. We have one such warehouse where people, when they need to go from one end of the warehouse to the other end, they drive. And they need to be massive to ensure scale. And if they are massive, they cost loads of money. So the business wants to return on that investment. And when you are talking about very slim margins, making return on that investment is really, really hard. That is the balancing act that Ocado found itself in. And within a big enough retailer, with a big enough balancing act, it's possible to make a profit. And that's also something that Ocado is one of the very few that have been able to do. And these are very massive warehouses. Now imagine you get to a place where you get that balancing act just right. Where you prove it's possible to make a profit online that you prove you can deliver first class service that your customers do and love. Again, every so often we have a running joke inside of Ocado, which is finding media references for Ocado. And good quality delivers, at least in the UK, is getting to be cinnamon with Ocado. We got to that place and then the business sat. Why don't we deliver the same solution to the biggest retailers in the world? Anyone, anywhere in the world. And so we created what we called the Ocado smart platform, the OSP platform, to try and power the world's largest retailers. That is a big challenge. And we found ourselves with a big challenge in our hands. It is great to try and deliver the best solution in the world. Another thing is all of the steps that we need to do to deliver them. So the business had a challenge for the tech teams with the ability to deploy quickly and efficiently. We created a challenge which we called internally, the zero to cloud. And it's basically how can we deploy this amazing, awesome solution anywhere in the world? Starting from an empty shell, from an empty data center, everything is off. We haven't yet figured out how to automate putting the warehouse or building the warehouse itself. I.e. the walls still require someone to put the walls there. Do that fully automated and without any manual intervention. You can hire people to do one of these warehouses manually. You can hire people to do two, three on a good day. Try doing tens or hundreds. It just doesn't scale. Actually, I think this is the holy grail of systems administrators. How can we start with something that's completely off, hit the power button on the first server, and then get everything else to build from it? All of the other servers, all of the applications, orchestrate, deploy, and make sure this is work. How can we start with an empty data center and build a fully working compute cloud solution with it? And that was our challenge. So we started building the next generation solution, which is, from the ground up, fully modular. Instead of requiring a massive warehouse from day one, we started with a modular solution that starts with the least size we can and then add another module as we need to grow and expand. That grows with demand. That requires a microservice type of approach. Actually, this is exactly what microservices are. It is starting about the smallest unit and add units as we need them and making sure they are as independent and replaceable as possible. And necessarily, it needs to be scalable, because what we do today will matter tomorrow for us. And building these massive warehouses, they need to scale, because everyone wants more customers. And actually, it's good when a company is growing their customer base. So we got this design, and we started building a warehouse for a column in the southwest of London, 30, 40 miles outside of London. We started building this solution, started deploying that solution, and learned as much as we could from it. We proved that our new architecture was actually working and would work and deliver a good solution. We started iteratively doing very fast deployment cycles, again, microservices, continuous integration and continuous deployment of the solution to make sure it works, early lifecycle warnings that everything was working, and continue building it, and got to a fully functional working warehouse that was providing the solution. I'm going to show you a video of that warehouse. Sorry, it has no sound. But just for you to see, what are we really talking about? Honestly, I've been there on the sideways seeing the move, and I still find this video really impressive. You can't feel the scale of the moving parts. All of these are controlled from applications that we develop ourselves, and are running in open stack. I really find that video impressive. If you want to see it afterwards, it's on Ocado Technology YouTube channel. Feel free to share it. It's public, and it's really impressive. That was a drone view of the real-time warehouse. That video is real-time, no speeding up, no, and those things are fast. Trust me, I've been next to them. So we built it. We approved it work, and we arrest it. Well, not really. We were giving the challenge, prove it. Building the first one, yeah, everyone can do it with enough resources. Now, go and build another one. This time, put two retailers in, one is Ocado Retail, and the other is Morrison's Retail, and build it in the southeast of London, east of London, wrong London area. Make it bigger. This second warehouse is so big, I stood on one side of the grid, and you can't see the other end. It's amazing, really. And so we did, and that was the challenge. And let's build another one. Let's prove this is not a one-off. Let's prove we can do it properly, do it fully automated. And as we were finishing building it, our first international retailer came on board, and we were really happy. I remember some woo-hoo's. Someone outside of Ocado who believes what we always believed from the start, that we could make this work for one of the world's largest retailers. And so we started building one of those warehouses for Group Casino around Paris area in France, for those who don't know Paris. As we were about to start building it, the business came to us and said, oh, by the way, build another one for ICA in Stockholm, Sweden. And we said, OK, we can do that almost till next day. They came back to us and say, well, you know, Sobys in Canada, they want to build one in Toronto. And we said, OK, what can ever go wrong? We never did two at the same time, let alone three. And then they told us, let's build 20 all over and around in the US. And we started running like hell, shaking, screaming. No, not really. That was the whole challenge from the beginning. How can we build this, automate these, run these fully automated without manual intervention? Doesn't matter if we have one, 10, or 100. I guess if we have 100, then we start hitting problems with our suppliers because they can't provide servers or breaks and mortar and all of those things. But the challenge was doing it over and over and over again without any manual interventions. And that's exactly what we were doing. Even on our, maybe not on the first, even on second warehouse in Earth, we made sure everything was actually full automated. Why not on the first? Because some things didn't work on the first, as they never worked well on the first. But repeating, doing the same thing, and doing it properly. Making sure everything works along in the process. And we have some big advantage from that. From the ground up, we designed for availability. We know a failure, it's only a matter of time. We will have some. Doesn't matter how much we plan. Doesn't matter how much redundancy we put in the system. Something will fail. Server will fail, desk will fail. Something will fail. And instead of waiting and reacting to a failure, we designed for availability. We think even if it fails, it doesn't matter. Things will continue running. To do that, we start giving from the ground up, full visibility of all of the availability. Applications know which availability is on their running, and know they should be protected. We have implemented things like affinity and anti-affinity to make sure things continue running and working. That has given us the ability to do things like full stack upgrades. Like upgrading the operating system of the hosts, like upgrading everything. Because when you get to a point where everything is actually separate, where everything knows the availability to expect, you can take out that one availability zone or that availability out of applications, and they will continue running. To prove everything continues working, we keep deploying. We deploy often. We deploy small changes. If it works, and they usually do, awesome. If they don't, we can very quickly identify what broke and roll back. And with that, we need to test often. We keep testing. We have enabled in production, and when I tell this to people, they call me brave, something called Chaos Monkey. For those of you who don't know what Chaos Monkey is, Chaos Monkey is one of the tools in Netflix's Simeon army. Chaos Monkey goes in on our production environment and shuts down an application or a database or a server at random during office hours. Whilst our warehouses are picking, packing and delivering customers orders. I can't remember last time we had any problems with Chaos Monkey. Actually, we had a problem the other day with Chaos Monkey, it stopped working. But other than that, no one is any the wiser. And when we started doing it on our test systems, we figured out loads of issues which we fixed and made to production. When we stopped finding issues in our development test environments, we started doing it in production. And I think it went well since the beginning, which is amazing because I was expecting some sort of failures there. And we do even full DR tests. What is a full DR test? Full disaster recovery test. We get into one of our server rooms. We have a big red button by our UPSs and we hit it. That shuts down the whole room. If you've never been to a room shutting down, it's interesting. It just cuts the power towards the servers on that availability room, on that availability zone. And everything keeps working. And we do that during normal office hours because those failures will happen anytime and they tend to happen out of hours. And by doing this, we know when we do have a failure because they happen, because an air con fails, a plane hits the twin towers because a butterfly flaps its wings in New York. And we know everything will keep working. And it's just making sure that it does and we know everything will keep working. And when we have a failure, we don't fret about it. We plan for it. We designed thinking that we will always have availability. So why do we use OpenStack? Nowadays, everyone's moving to a cloud-first. Cloud-only, I would say. Okado is no exception. We are a cloud-only. And having the same set of tools, the same solution everywhere just makes that simple. We're using a very microservice-oriented design. And things like OpenStack very clearly allow us to have microservice design, where we treat things like cattle, not as pets. We treat our applications as cattle. We should in the head when we're not happy with it. We even enable Chaos Monkey on them. And if they go away, they go away. Not as pets. Where we, I'm old enough to remember when we treated all of our machines with all tender love and care, sometimes even more than our own children. But at least as pets, as we love and care for them and make sure they are always running. But then again, if we have a problem, then it's our pet and we do a lot of work to make sure it works. It also helps by providing separation of rules, making teams independent. Instead of having TMA waiting for team B to deliver whatever, we give the tools for self-service, make sure TMA has everything they need to continue working. I don't remember if you remember, if you, I don't know if you remember that video of those bots moving, but latency for us is very important. When we have a real-time warehouse with real-time moving parts, latency is important because 10th of a second means one of those robots instead of being here is here. And bad things happen when things are not where you expect them to be in the real world, as you may have guessed. And complying with best practices, it's just making sure that it's a cloud-based solution. If they are developing for Bocados OpenStack internal cloud, it's the same thing, it feels the same. It works in the same way as developing for any other big hyperscaler or big public cloud out there. We even go to the effort to make sure all of the development teams have very familiar interfaces. If they are running their applications on the public cloud on top of our infrastructure, it feels, tastes, and works the same way as those applications running on the local, on the internal cloud. The interfaces are familiar and everything works in the same way. How do we use OpenStack? We use OpenStack as a cloud platform, as an internal cloud platform in an army of tools. We have OpenStack locally to our warehouses because of latency, as I said. And everything is done, designed, and thought for the lowest latency. We're very concerned about latency. We are using Calico as our networking plugin and after doing some tests, after all of the work in tuning the service as we did, I was able to get a latency on a packet from instance to instance across different availability zones, so across different hosts, racks, rooms even, of 40 to 50 microseconds, which is at 10 gig, a line rate, basically. We can get slightly lower at line rate. We develop, then we develop a layer with Kubernetes on top of it that ensures all of the applications are orchestrated and have R deploy them running and make sure everything is working. That layer is actually providing the layer of abstraction between the OpenStack cloud and the public cloud, whatever public cloud we use behind the scenes. All of the levels of the stack are thought to be scaled out. If we need more, we add more resource and performance also increases. That's why we use Seth as our storage layer. As I said, our ethos is to fully automate it, is making sure everything is as automated, as repeatable, as done in the same way as possible. As you may remember, we are talking about 25, 30 warehouses short-term and that means 25, 30 applications or copies of applications to run. If we only have a copy of an application in a single site, which we don't. So you can see the multipliers there, making sure everything is fully automated, fully orchestrated, gets that much more importance. We got to a level where we integrate all of the hardware management. Remember when I said we do the Holy Grail of systems admin? On a new site, we power on the first server and every single server after that is built from that first server. That first server will log in via IPMI on the other ones, make sure they have the correct bios settings, make sure it has the correct config on it, boot set, installs the operating system, and orchestrates using the whole operating system. We have a script that we start on one of those warehouses. It takes three or four days to run and it outputs at the end, this is what was built and wasn't built because every so often people forget power cable, the servers, or don't put the server where they should and my script can't handle that. We still can't physically move servers in and put them in the server rooms, unfortunately. But it does that from the ground up from the configuration of the server all the way to the application and everything is integrated. For me to add a server into the solution because we need more compute is rack the server, cable the server and wait one hour or two hours for it to be fully built and working in the cloud. Other than the physical installation itself, no manual intervention. And this is a thing I keep repeating. We automate everything that we can. How do we deploy open stack? So, we have several and we try to them to be as realistic as possible environment. All of our environments utter the number of the compute and set notes and we do task with different numbers of storage and compute notes are exactly the same. We try to use the same hardware built in the same way and make sure everything is actually the same. As we are integrating to the lower levels of hardware, we figured out that having similar but slightly different sometimes the script works, sometimes it doesn't. And the more you want to automate, the more having a realistic environment helps. We start with our lab environment. Well, not really. We start with our test VMs in our laptops and things like that. But I don't count those because I have three or four and I keep testing things out. We start on our lab environment. The lab environment is an environment where me and my team are the only ones who have access to it. It is an environment where we do what if tests. It's the only environment that we allow some level of manual tinkering. What will happen if I change that variable? Usually it means that thing crashes and burns, but you will have to learn somewhere. What will happen when we do this? And this is where we start testing all of our automated deployment scripts. How can we do such thing to work? What will happen? How do these pieces integrate? We promote that changes to set SIT, System Integration and Testing. This environment, we don't allow and I have not so happy conversations with the team. If I see someone doing that, no manual interventions. This is where we prove all of that automation that we just did really works and usually does. If it doesn't, we go back to lab, figure out, promote changes and do SIT again. The regression environment, I told a small lie. The regression environment is an environment we keep building, rebuilding and making sure everything keeps working. There is some level of manual tinkering in this environment which is to get servers into pristine out-of-box ways to make sure that all of the systems and all of the hardware management actually works. So yes, there's some manually there but it's low default settings to make sure everything will keep working. We keep building, rebuilding this environment to make sure everything works. If we're testing an upgrade, we test the old version and install everything on the old version, have a full suite of tests running against it, run the code to install the new version and hopefully it continues working. If it doesn't, we go back to lab, figure out what happened, promote to SIT and get here. Again, this is as early as possible, getting feedback as early as possible in the process to make sure everything continues working. Then our pre-production. Pre-production for us are like any other production environment. It's the development systems for the tests, for the developers' tests. It's the development systems where everything makes sure but for us is production. These are the systems we deploy first when we are absolutely convinced everything will work in production. And finally, all of our production environments. These are the ones running our warehouses. Hopefully no bugs, no issues ever found their way here or if it does, they are present already in all of the other environments above. Current challenges. Something we haven't been able to quite crack yet is version upgrades. The problem here is not OpenStack. It's our drive to make sure everything is as automated as possible. For me, the ideal scenario on version upgrades is what we do in SAF nowadays. For us, SAF upgrade nowadays is one get commit away. After testing everything, the last four or five upgrades within SAF was as much as a get commit and a get push. Keep forgetting that one and then things don't work. So that's where we wanna get at in OpenStack. With our drive for automation, orchestration and making sure everything is exactly the same, it gets really hard. This is probably one of my pet peeves. Either I have way too many logs and I can't find something or I have no logs and I can't find something. Either way, when something goes wrong or something doesn't work as it should, I can't find the logs for it. Doesn't help that OpenStack is as complex as it is. It's running on several hosts with very complex requests. Something that went wrong in Heat actually wasn't working on Neutron or something that went wrong with Cinder. Actually, there was a problem with Keystone. Actually, these are two problems we had recently and seemling somewhere, but it doesn't and the errors are not always as helpful as they should and we have yet to find the correct level of logging in our solution. I don't like the default network stack. I think the default network stack, as it is in OpenStack, was, and it works pretty much the same way in all cloud providers, either OpenStack, public or not, was Cezad Men or a developer solution to a network problem. Done by someone who doesn't really know how networking works. And I don't think it is very good. The first thing we did with our network was to rip it all apart, build a full layer of free attached network using BGP and Calico. A caveat to this is I don't know Octavia but the default load bounce resolution we had and again, I haven't looked at Octavia and maybe it will fix it. I don't think calling load balancers or highly available is the correct way to explain those. And it's working based on some layers to shenanigans behind the scenes that most of the time do work, but we completely rip it and built a fully solution. As I said, we have a fully layer free attached network. That means we can do things like any CastVip without any external load balancers. We can have the network choose where to send things and the network delivers the traffic for us and runs at wire rate. And whenever you do something, the first thing you are compared to is the public cloud, is compared with whatever is available out there. And whenever I deliver OpenStack and I prove way it works much better, it's way cheaper, it has a lot better performance, people keep comparing that with the bells and whistles in the public cloud. OpenStack is jumping leaps and bounds almost daily. It has loads of projects doing very interesting things, but even then it's very hard to compete with the public cloud. I'm only doing private cloud, not public cloud with OpenStack, but I probably can feel the pain of the people who are actually using OpenStack to provide public cloud because we as humans, the first thing we will do is compare our solution with everyone else's. That concludes my presentation. Thank you. Any questions you have for me? So how do we replicate our data between OpenStack and public cloud? We look, there is data that we want to replicate and there is data that we don't. It is part of our data governance to copy, ship, and prepare all of the data that we need to copy, prepare, and ship there. I.e. we have a very targeted approach. Data is a very hard tricky problem to solve. There's loads of it. I don't know about you, but at Ocado certainly there is loads of data. So having a data governance that looks at exactly what data interests us in and ship only what you really want and work in ways where it's interesting, it's the important stuff. It's knowing the organization and knowing what matters to that organization and how can you ship that data. Questions? The robots are connected wireless, yes, because having cables all around wouldn't work. The first thing would be, the first day job would be to entangle all the cables. We have developed a 4G-based wireless protocol called RCOM. There is loads of information regarding RCOM, it's 4G-based and it's latency oriented because that's what suits our needs. No, no because of density. Does DST affect our robots? When we designed it, it was thought out ways to get away from, to not affect. So that was planned from day one. It's making sure everything works and thinking on availability and how will things work when you have two hours on the same day. So the robots continue working throughout DST or changes in DST back and forth. We're using Puppet, first thing. How do we, the question was how do we orchestrate stuff? We have Puppet and fortunately or unfortunately we started running Puppet on version Argonaut. When there was no Puppet available for them. We got a third baked or a quarter or 20% baked Puppet module from upstream which we have been rewriting and redoing internally and we've been using Puppet, not using upstream. Yes, Seth. As I said, we do have an advantage there because we started really early with Seth. We have played with very different hardware and we target to do that. I would say the best lesson I would have is again, going back on the Pats vs. Cattle is we only have in Seth and looking at performance, looking at latency, we make sure everything is as close as possible. So we have lots of policies for example in memory management and try to avoid desk based latency to affect you and have ways to make sure. Having said that, we have worked a lot on Seth to make sure performance is as consistent as possible. It works as consistently as possible and doesn't do anything it shouldn't. No other questions? Well, thank you all for listening. I hope it was good.