 Well good morning everybody. I hope you can all hear me. Well this morning. I would like to tell the story about how Picnic who's been cloud native from the start? Decided to go on premise and ultimately also decided to go back again. So Let's go First a short introduction about myself just to know who I am and how I was involved with all of this I work now for about three years with picnic Where initially I started off as the founder of the platform Python platform team to launch Python as one of the key languages to be used within the company But already quickly on I got involved with the infrastructure team at picnic and took up the law the role of product owner Which I've been now been focusing for for the last two years or so Been in the field a little bit longer So this was not my first core organization and was primarily involved with the ad tech industry up until then where there was quite a bit of infrastructure involved with Running real-time bidding platforms for advertisements not so cool now my mother finally knows what kind of work I'm doing which before she had no clue of So what does picnic really do well ultimately we are and grocery delivery service Back then we were one of the first now Of course There's many competitors on the market and one of the things that we do differently than especially the flesh Delivery services that you see nowadays is that we deliver We position ourselves as a milkman 2.0 So the milkman in the Netherlands was this this friendly face that would come at your door at a regular basis to deliver the groceries that you ordered in an earlier stage and we want to replicate that kind of experience where it's friendly phases that you know That come on your doorstep, but now you do not have to order with him But you can do it from the comfort of your mobile phone And that's quite nice in order to do this The way that we differentiate ourselves from these flesh delivery services Is by actually fully controlling the supply chain which allows us to deliver the groceries without delivery fees Which is quite nice? We can actually offer a wider variety of different products because we have a bigger sort of supply chain That has a bit more latency because we do not do let's say within the next 30 minutes deliveries But therefore we can actually give you groceries for competitive prices compared to the regular grocery stores And with a big inventory of products that you can select from So that's a bit of what we do in order to facilitate We control pretty much the entirety of the supply chain except of course for the products that are still being Produced somewhere and then from there being shipped to us infrastructure at picnic is Was the cloud native from the start so everything is running in AWS where we were running EKS at its core We try to make use of most of what AWS is supplying So our ingress controller for sample is the AWS ALB And we try to keep quite a consolidated stack of technologies That is allows us to specialize within those technologies and maximize sort of the benefits that we get out of this And as we grow and new teams join the company or new people join the company They can easily pick up the learnings that we did on all of those technologies for new products that they're starting to launch And so with this consolidated stack that we manage via terraform and space lift to orchestrate all of this We enable teams with nicely pre-packaged solutions so that they can focus on their deployments that are then running in Kubernetes Which we expect them to maintain so infrastructure picnic is about the Tooling around the deployments and the development teams are actual DevOps teams that both do the building of the software But also the operations of the software so What we do let's say on a daily basis so a big part of the supply chain that we That that make up our offering to our customers. It's actually taking The products that are shipped to us from the producers and turning it into something that we can actually deliver to our customers I'm not sure how familiar you are with our proposition But the small epvs that we have the electric vehicles to deliver the groceries are quite iconic and you see them everywhere across the Netherlands now And those are fully tailored to the specific service that we supply and turning all of those groceries of this big pellets of chunky stuff Turning it into something that we can actually comfortably deliver to somebody's doorsteps is a lot of work You have to imagine that you know These different products they all have different behavior of how to treat them So for example, we have frozen goods which need to be kept below zero We have chilled products that need to be kept below seven degrees and of course we have already regular products But even then you have stuff that can easily break so a bag of chips You cannot put it under a bottle of coke in a tote that you're shipping out And that adds a lot of complexity to this process which you initially set out to do fully manually So actually turning these kind of pellets into the things that we deliver a lot of manual labor involved And exactly that was something that we want to stop doing This is exactly sort of the bottleneck that comes up in an organization like this It's really hard to scale up. You have to imagine that the process that we Implemented is effectively a huge massive supermarket an industrial-sized one that is fully tailored to optimize Picking routes for people that walk through this and that is great We spend a lot of time on this and we could maximize quite a bit, but it also has its limitations doing this work is As you can imagine, it's labor-intensive. It's quite heavy There's risk of things hitting each other. So it's also not fully safe It's not the experience that we want for our employers We want something better and not only this for our employers But also we just hit limits of what we can do on one physical location So if we want to deliver more groceries with the same amount of people we need to have something else in place For that a couple of years ago We launched the welcome to the future solution of picnic to this whole problem, which was a fully automated fulfillment center So that process where you have people walking around this whole this big supermarket kind of environment has been replaced with the fully Automated system comprised of about 14 kilometers of conveyor belt Those conveyor belts they all need to be controlled and we need to yeah We need to control all of this such that we take the right products out of storage We put them on we put them to the right people on the floor that actually take a product out of one storage Toad and put them into another one In such a way that ultimately we have again the groceries prepared that we can ship out and this operation is massive It was the largest site that we launched back then We now have some some larger ones, but it was definitely the most complex thing that we tried to pull off To control such an operation There's a lot of software involved in actually making this happen So in essence, there's three levels in which you need to operate such a warehouse The first one is you need to be aware of what kind of orders are coming in and what that kind of orders need to be fulfilled For what kind of moment in time and track that those things are actually Starting to happen as we need to make sure that the products are actually in-house in order to fulfill those etc So that's sort of the high-level view. That is what we call warehouse management Then there's a level of warehouse control, which is more about Ensuring that the right products are put in the right bags, which are destined for the right people such that such an order is actually fulfilled And that is what we call warehouse control and the last level especially then in this in this kind of automated warehouse You have something we call transport systems So this is actually the software that directly integrates with all of that physical infrastructure that is running on the side So the conveyor belts we have millions of scanners We have actuators things that junctions that we can't put one direction or the other a lot of physical stuff that needs to be maintained Typically if you would go in such a track You can get these kind of solutions of the shelf We have a vendor that is actually building all of these conveyor belts for us But picnic from the start has been pretty that that's pretty focused on building software themselves Which you normally would get off the shelf. This has been quite a successful Method for picnic and allowed us to optimize a lot of processes in a supply chain And we also believe that we should do the same for warehouse control in this case warehouse management We already did for our manual fcs Warehouse control was something you would typically get off the shelf But this especially here is the place where you can do a lot of Optimization and improvements which you would otherwise have to wait for a third-party vendor for you to implement So with this kind of model The question came to us as an infrastructure team like how can we facilitate this all of this software, you know Where should we run this so we started to think about the different types of The way that we should want to do this and Effectively we were able to identify three distinct types of workloads You know the first one is the one that we all I think here love and enjoy That's the stateless containerized deployments that you're typical Kubernetes deployment that you can easily scale up and down This is especially in this case the picnic software the things that we always build ourselves because that's what we do from the start And then of course some of the open-source solutions that we have around is to facilitate all of that Then the second one is the stateful VM based systems. I'll come back to why this is VM based But in essence, it's the data stores that we would normally get as a SaaS solution We now somehow need to run this also close by because it needs to be on premise This would be postgres rabbit in queue or it's a typical data source deck And then the last one which is definitely a pain for such an operation, especially if this kind of This or this automated warehouse is destined to support about a quarter or half of the primary market that we That we cater for so operational Quality is essential and then having systems that we know are not highly available Means that anything you do with such a deployment effectively means outages and that's a huge risk for our entire operation In addition a couple of let's say operational aspects were identified that we needed to be able to achieve So the operational hours are effectively 22 7 All days except for Kings Day Well, we need to have one celebration at least then there's the availability of four nines that we need to achieve because it's such a Critical part of our entire catering to the Netherlands and then we only have a small team to pull this off This combined with sort of the the key let's say aspect of this of this of this whole solution Like this was a massive important milestone for picnic to prove our long-term Let's say way of operating was a quite a critical project in In addition one I think one very unique peculiarity. I think for such an operation is that An operation like this runs at the fixed clock rate So where normally of course you have to deal with web traffic and it goes with your typical web sort of a low to kind of Trend lines here. We have a conveyor belt that connects everything together that runs at the fixed rate It's a little bit less than a meter per second It means that everything that happens here sort of event base or we have scanners everywhere as I mentioned They detect that the tote is coming coming along and maybe there's a junction after the scanner You need to decide is it so it's gonna go left or right or straight or right? That kind of decision point is everywhere in this operation. It's many many many many places However, any kind of junction point that is being hit by such a toad It means that that event needs to go from the Transport system to the warehouse control system the warehouse control system actually needs to do some smart thinking to to find the next step Then send back the information again And all that needs to happen in a couple hundred of milliseconds or even less than a hundred milliseconds before that Toad already comes to the junction point for example that you want to put into a different direction If you miss this kind of opportunities these boats will go into a wrong direction And those kind of let's say misdiverge what we call it are very costly to recover from because As early phase it was just physical people they need to go there pick up the toads put them back on another conveyor belt And now we have all kinds of software belt to do for this for us But it it does add a lot of complexity on top of our already kind of constrained processes that are running here and That is something that is quite different than some of the other operators that you would typically cater for So we set out to find the solution that we felt like was able to achieve all of these kind of Requirements and all of these kind of types of workloads We opted for a VMware based stack that we externally Managed that we let external people manage And it's fully running on premise in this location This specifically was decided with you saw this the team size not so big That's allowed us the picnic infrastructure team to focus on the actual catering of the workloads So in this case the posters close to a draft in Q the picnic workloads around is to make sure that that all works We opted for Running two identical rooms that were fully prepared to run the entire production workload Allowing us to continue operation. Even if there's a reason to shut down an entire room. You can imagine that if there's a A power outage in that room where you want to do some significant maintenance or refurbishment on the internals We can actually continue we can actually do this without impacting operations because we have that kind of capacity onsite in another place We actually had an issue quite early on where the cooling unit completely shut down in one of the rooms and that allows us to Immediately shift all of the workloads to the other room and continue operation there Which is very handy in such a situation and the vmware layered actually gives you the opportunity to transparently migrate Workloads vm-based workloads from one location to the other So on one side it does it for you if actual physical hardware fails So it detects it and it will start up the vm in another location where there's still capacity But then also it offers a solution they call life migration. So you can actually move Running vm's that are actually part of critical operations You can do it live Migrated from one place to the other without impacting operation either and that especially is a very nice feature to allow us to do all Kinds of maintenance to the existing hardware that is there and then lastly Fees here comes with a solution They call fault tolerance and I think this is a very interesting one It allows you to run any vm-based workload in in replication effectively That was never billed to be highly available in essence what they do is they run the same vm twice and then Synchronized them fully so on the network stack on the on the operation that are executed in the CPU and only when something goes wrong In one of the two instances will they differ traffic to the one that is still remaining and especially for the workloads that I Describe that are not highly available. This was a very important feature for us to add on top of and then of course You know, we still want to do Kubernetes, you know a picnic and especially the teams building the software to control all of this They they've grown up essentially in the company to use Kubernetes So we want to still have the same offering and in this case we opted to run Tansu TKGI So this is the VM VM where offered solution for managing Kubernetes clusters Essentially sort of the EKS of VMware We opted specifically for this because it gives you something that they call VMware validated design So in essence is VMware that's going to tell you sure If you use all of these components stack together then we guarantee that that functions as it should and that is a nice guarantee Because especially if something goes wrong, it's we can rely on VMware support to take the series and to help us out to solve these kind of issues Now that's all good of course while opting there's a lot of choices that you have to make and decisions that you have to consider and especially I think for the audience like like you guys There's a lot of options that that might have come already up to mind So one thing is of course you can run Kubernetes directly on hardware or at least we can find a solution that allows us to Adjust use Kubernetes directly on some physical infrastructure We opted not to do this mainly because the workloads that we've already identified. We needed to run those VM based workloads That are not highly available But we also felt that there's too much of a skill gap to also take responsibility Of actually maintaining these kind of clusters on the master node level We heavily rely on EKS So we are actually users of Kubernetes and not the operators of actual clusters as least as how we see it And therefore we felt it was too much of a risk to go into the S direction You can run of course virtual machine workloads within Kubernetes via Kubeford Which actually is a really cool technology and I think I would have loved to use it was it not for such a Critical site at that point in time It supports live migration, which is nice, but it's behind a feature gate still So effectively a data feature, which is really a bit on the risky side And of course it didn't have that turn a non-HA workload into something highly available Which for us was back then especially quite a miss We also considered to run postgres and rabbit in queue in Kubernetes because now we need to suddenly run VM based workloads for this Especially then we were looking at salando's offering for for this, which was quite new back in the time Especially for postgres. There was a lot of people that would still actively refrain you from using Kubernetes for running these kind of systems So we also opted to play it safe and go to something that we at least know if we go into these boxes We can actually look at the you know, there's no abstractions around this There's no Kubernetes operators or something else that is making a harder to debug. We can go into these boxes We had people that knew a lot about postgres So that felt a lot safer and the same goes for revenue queue Of course, we have the operator that is supplied by the revenue team But still if something in that direction goes wrong, we didn't feel to the right We didn't feel enough prepared to tackle those kind of issues in such a capacity and lastly either is outpost I think this one is something that especially if we would redo this now would be a very high contender Back then it was just not generally available for for the Netherlands We did have good lines and there was some opportunities to try it out We would be early adopters. You would see this for example in the support for RDS back then The multi AZ or failover at least functionality was not there or not fully ready And especially back then it was massively more expensive than the solution that we opted for with vSphere Now I think it would be a very interesting candidate to explore Now with such a solution in place you think okay, well, we have all of the technology We have all the tools we we feel that these are the right set of technologies And we hope that with that similar kind of experience that we were looking for with Hardware fertilization all managed a platform that is similar to EKS Is it as easy to operate? Well, honestly, it's just super complex We ended up with a completely new tech stack that we had to operate and maintain we were good in AWS We were you know, we have a lot of experience there We built a lot of tooling and we effectively had to duplicate all of this We were hoping to be able to abstract away these different types of models of operation into modules of Terraform and Ansible Etc. It just doesn't work the details are too painful to abstract away Which meant we had to do a lot of things twice For a lot of the technologies we need to build a custom stack of observability Where otherwise you get these out of the box from AWS or for other providers Now we needed to do our own observability stack for postgres and for the proxies that come with postgres, etc Quite a lot of complexity and then of course we do this for the first time So yeah, you're gonna run into issues that have already been solved that others have already been doing better And that's just a painful operation. So the end result of this is that yes, we were able to pull it off but we have a huge high operational load and The reliability was Definitely not what we were hoping for now This was not something that was only hard on the infrastructure side running this operation was tricky for everybody It was the most big project that we ever pulled off and just getting it also on the physical level working on the operational side People around it. It was just super complex So the corporate development teams were already looking into an alternative way to do this and they opted for something They call a hybrid FC effectively less complex in terms of automation a lot easier to build what faster to build and Better to pull off for this project. We opted for another third-party vendor, which actually said well I'm pretty okay with you guys running everything from the cloud instead of doing it on premise It was something that the other vendor was absolutely Against and was the reason that drove us to on premise in the first place So with that we build this we launched this it went super smooth It was like a fraction of the time that we needed to invest Compared to the other location. So the led to the logical question was Yeah, can we just not do this also for the initial location fca? In order to confirm this there's only one way to do this you can't really simulate such a complex operation You just need to do it live. There's no way to validate Such a situation. So in essence we identified let's say the key points where we would know In the end state, this is where traffic would go from the cloud to on premise and started to introduce Artificial latency on small iterations up until the point that the operations team would come screaming at us Complaining that things were breaking down give them time again to recover and then identify if it actually was us because there's many things going wrong in Such a location and then reassess what we wanted to do now We had to do this multiple times I think five or six times super painful for everybody in fault also the recovery takes a long time a lot of effort But ultimately we did conclude yes we can and that was amazing because now we suddenly can think about going back We can actually say well if we want to launch such a location again We start doing it immediately from the cloud and if we can get to the cloud for this location as well We could just cut our operational workload massively and that is you know from a strategic point of view and conquering the rest of the European market a super powerful proposition But then you are in a place where you need to do a lift and shift back to on premise Of course on its own. It's not an easy feat. I mean, it's not the largest scope I think a much larger lift and shift operations have been conducted But you know we're already launching new locations There's a lot of stuff that my team is now involved with Doing a lift and shift back is not so great But AWS actually has a service that they offer that's called AWS MAP migration acceleration program And if you're large enough and you are and you will call a spend that you will end up with Paying with them after this migration is large enough. They will pay you for doing this They will make sure that there's resources they get from external parties to actually help you do this And now we're in the process into preparing all of this where we as a team We just validate that the things that are gonna do our sound and we make sure that you know operation It's fine, but we can sleep between one and three They'll do the operation and they do it very well and that for us is super super important So we're very happy with this now If you look at what we sort of learned from this I think For us specifically and I think for such an audience is here and then some of you probably already know But Kubernetes just does not abstract away the actual physical world if you rely on so if you do not have if you use The ALB ingress controller, then of course going to in the place where it's not there It's gonna be harder So if you've opted in for all kinds of Sass offer solutions that you hook into your your Kubernetes cluster then going to a different location will just not be the same a lot of additional complexity comes with this and Even if you outsource the management of physical and virtual infrastructure, you still have to deal with physical reality We had the power outage or well, we had the power vendor that was not very reliable So every so often we had the diesel generators kicking in up until the point that the gasoline was just depleted and the whole side came crashing down like you you need to learn this I guess in practice Maybe there's people that do this better, but it's just stuff that you have to deal with and like I said the cooling Equipment failing etc and then you know working together with these different third-party providers That build such a complex operation for you if you say that this is mission critical or this is so essential for your organization They are gonna account for this. They're gonna add buffers. They're gonna make sure There's never any risk in there, and I think it's also Prevented us from finding a solution that might be much better equipped or fitting to what we as a company are and how we want to Operate and so I think it's important that you always collaborate very closely understand where these requirements are coming from And what is driving them and in what capacity do you want to trade? Let's say responsibilities over certain decisions instead of just assuming that everybody just needs to do what they need to do And I think that's a very powerful thing, especially in such a complex operation And then the last one is an interesting one So there's actually an ongoing discussion out from I think the guy who created rails who is now shifting a lot of their Cloud infrastructure to on premise saying safe. It's gonna be saving millions and millions in our case It's gonna be actually more cost-effective to run on AWS I mentioned already debuffering Some of the deployment requirements that our third-party vendor was asking for the one that's physical They couldn't do virtual CPUs physical notes only Very high resource users and being able to scale up and down in the cloud Which still allows to you know better fit this kind of resources instead of buying very massive health of hardware That's hardly used in addition You know the discounts that you get the liability that you get and all of the solution the functionality that you get out of AWS Is something that you otherwise need to invest in yourself. You can do this But it just requires a different strategy and so in our case It's cheaper and it also allows us for example We're now playing around with the the graffiton instances that are available for running also in Kubernetes Experimenting with this and being able to also adopt this now for these kind of Locations as well. It's just something that we really love and want to see further developed So that is also something that we massively enjoy of this And that's it. That's my story of how we how we went to on premise how we felt is not so Fitting for us and how we're now moving back to the clouds if you are willing to help out or interested to Help build the best milkman that is serving millions of families Then always feel free to join and reach out and join the picnic of death, and I hope you learned something today Thanks, yes, thanks guys Raker station. What happens when the current if he fails between AWS and the So if the internet connectivity fails, right, it's a question. Yeah So we spend a lot of time on this and opted now for a Double line of internet connectivity. So these are not just let's say You know T-Mobile and KPN, but this is I believe the term is called dark fiber So we have a network team is focused on this So these are physical different lines and going to different directions that go that connect to different Connection points that end up in different parts of the general Internet infrastructure all of this to account for let's say any wire cut or any Yeah, any physical let's say issue with those lines in addition we use what they call AWS direct connect So that also reduces the amount of hopes that you go over actual public internet instead instead relying on Dedicated lines directly into the AWS infrastructure Thank you for the presentation One question about the PLC's which are in general mostly a very classical piece of IT How does this work together with the very modern infrastructure that you're running in AWS? Yeah, so in our case So the three levels of control So the transport system that is in place is effectively the bridge for us between that So so they take into account a day support all of those low-level protocols. That's sort of they translate to that physical World and we just purely rely on the events and information that they emit So we asked our third-party provider of this, you know, we are really good in Reviton queue So whatever happens and that way we need to react on you emit an event on Reviton queue and we'll take it from there So we really completely Isolated out that kind of physical and classical kind of IT infrastructure so that we can focus on what we think and do best Any more questions? No Thank you very much guys. So thank you brilliant