 Hi, my name is Garrett Griffin and I am a senior engineer and technical lead for the infrastructure and platform for digital fulfillment Hi, my name is Justin Turner. I'm the engineering leader responsible for digital fulfillment at H.E.B today, we're going to be talking about H.E.B curbside fulfillment systems and How on our modernization journey we accelerated things by adopting link or D. It's going to be a technical fact-based talk in the hopes that our adoption of link or D And the the technical aspects of it benefit you if you're thinking about adopting a service mesh Additionally, we'll talk about how we've this factored into our pandemic response a Little bit about H.E.B. It's a Texas based retailer that was established in 1905 as several hundred locations across Texas and Mexico And it's deeply ingrained in the spirit of Texas You can talk to almost any Texan and they're going to tell you how they love their H.E.B One of the services that we offer is curbside and home delivery. This allows our customers to go shop online Submit their grocery order and we do the shopping on their behalf in store and then hand it off to them at the curb Or take it to their door We do this through a set of fulfillment tools called fast This is a mobile and web application that allows our in-store partners to shop Substitute manage orders navigate through the store and ultimately hand it off to the customer We started early on with a monolith When the curbside was a proof of concept This is how it began life and we started to run into challenges as we scaled up the business struggling to deliver quickly lots of risk associated with change and ultimately Reliability issues so that put us on our modernization journey to break into microservices and If you want to learn more about the modernization journey itself or our reliability efforts You can check out the the talks that are at the bottom of the slide Early February we were very close to the finish line of our modernization efforts We had our services mostly complete with parody enough to start rolling into stores and We were aiming for May to start doing that This is when things changed they changed for me. They changed for you our customers in the world COVID hit this also changed the nature of this service that curbside provides We went from being a nice to have convenience To a critical resource for slowing the spread of COVID in Texas and a Lifeline for immunocompromised customers that want to reduce the risk associated with grocery shopping This changed our priorities We went from primarily being focused on finishing up services to really Reinforcing and enhancing the monolith to handle the oncoming load that we were anticipating As well as enhancing with features and functionality like best available products to handle inventory shortages Snaps so that every customer that is around a curbside can utilize it and A variety of other features that helped us respond to COVID The side effect with this was that it pushed out our timeline It was the right thing to do for Texans But ultimately we still believed that the monolith was not the system that we needed at curbside anymore and We wanted to Before more features were added and the growing list of needs. We knew we needed to Bring the timeline Closer so that we could build the best system for Texans and really focus in on what mattered We did this through a couple of measures. We reorganized ourselves to get through the work faster We really leaned into our chaos engineering approach to make sure that what we were building was resilient and then we Adopted linker D is our service mesh. We brought it further Forward from our roadmap, which we were intending to get to after the rollout and went ahead and prioritized it this Would help us solve or the hypothesis would is that this would help us solve some of the challenges we were running into as we built our microservices and the operational muscles around supporting them helping us solve early observability challenges and Ultimately like some of the retries and Networking that we were starting to really have to wrap our head around to make sure that this was a reliable system When we started our cloud journey, we knew there would be challenges ahead of us We knew we were going to split our monolith and a microservices and with that comes a lot of added complexity To start we now had another layer of complexity within our own ecosystem Having to worry about state and security at a whole new level On top of that our CICD system had begun to show its age While it was a great solution for a while It had become a little bit cumbersome to deploy the service manually to over 80 servers and with that schema management had become more difficult as we had another team to Help us in applying those changes, which of course would require coordination But now that we were getting a fresh start we could start about evaluating these things What can we automate to simplify our CICD system? Would we be able to take a more iterative approach to schema changes even better? Could we apply those changes in our deployment pipeline? Since we're attempting to be completely stateless Would there be anything to help us with retries and timeouts in between services? Were there any tools that could help us visualize all of this? With research, we knew that a service mesh was in our future, but we had to figure out the timing Planning and going down the what if rabbit hole can only get you so far So we set out to just do it We defined our domains and decided on a service to split out of the monolith It didn't take us long To figure out just how much was changing we had completely overhauled our CICD system Branching strategy how we work through and thought about backwards compatibility with contracts and schema changes All the while learning new tools like Docker and Kubernetes We even began to learn new things about tools we've used for a long time like how the JVM would act once it's containerized It'd be fun just to learn how hungry Java can be with memory We thought this cloud thing was supposed to be easy and solve all of our problems But we've realized pretty quickly that we could probably use a little bit of extra help So here's the disclaimer friends a service mesh is not for everyone While you get a lot of things out of the box You now have this extra layer of complexity You have to ask yourself if you're really gonna use the dashboards care about the metric scrapers or if you just really want MTLS if you're only needing one of those things you're just hurting yourself by going all in There's plenty of tools out there if you want to build your own mesh After our research and reviewing our needs a mesh just made sense So we set out to integrate and see what would happen We saw some pretty heavy deal breakers with some of them though At the time we couldn't create our own internal load balancer as it wanted to handle that and take care of it for us Others wouldn't let us create our own ingress solution as it was built into one of the services it provided There's also a case of a minor version update breaking our services causing a considerable amount of downtime and time to fix it Then we came to Lincordy It was lightweight with plug-in to the existing ecosystem Didn't force any ingress solutions on us and we could still use third-party tooling These were major wins for us already. So we decided to plug it in and see how well it played in the sandbox Out of the gate. We were pretty impressed with the documentation Our proof of concept went almost textbook by the docs provided with the CLI We ran our checks to make sure it would integrate ran and install command and just started working We did see side effects for any of our unmeshed services So we integrated We injected the sidecar proxy Into an existing service and we began to get that almost instant feedback It worked a little too well to the point of being suspicious But we were liking what we were seeing at the time. So we pressed forward With the POC going so smooth. We decided to open this up and mesh the rest of the services for a particular namespace Out of the box. We started seeing the golden metrics and the service maps starting which started to get us excited During this time we managed to find out the hidden gems of Lincordy We met with Boyan and the amount of help they gave us was staggering They've been with us pretty much every step of the way since we decided we needed a surface mesh And it's been great having them along during this journey Also, the Slack community has been awesome to interact with Keeping an eye on the channels has helped us stay ahead of issues when upgrading or seeing other potential pitfalls People have come across during their everyday lives with the mesh. If you decide to go with Lincordy I strongly suggest having an engineer or two Join the community and interact with them. It definitely won't be something you are going to regret Now that we're hooked we needed to productionize the system We needed to enable high availability mode for production and wanted to keep everything in code as is the standard Our clusters were already written in Terraform. So by this point we weren't strangers to the concept Thankfully there's already a Helm chart written for Lincordy But there's a slight hiccup we ran into during an upgrade We found out we couldn't just upgrade the Lincordy image when moving from one version to the next We had to upgrade the Helm chart version as well which make contract changes That aside upgrading has been pretty painless By the way though read the documentation closely as problems can arise between the chair and keyboard A line was missed while reading and we may have marked our kube system namespace for proxy injection Which really wouldn't be an issue until you do something crazy like bringing up a new cluster Which will begin to struggle to become healthy because it's going to wait on Lincordy proxy injector to become healthy Which is going to wait on the critical systems to become healthy Which is waiting on Lincordy? So on and so forth It became a fun class and debugging but could have been avoided Now that we were productionized we were feeling pretty hungry for more features But of course growing pains are bound to happen while dipping our toes into the world of canaries We found that sometimes our schemas were not as backwards compatible as we thought As in they weren't and things were breaking So accommodations and new mindsets had to be made and upgrades to our charts would follow but more on that one later Before we could get to that point. We had to get our service profiles up and running and Lincordy a service profile as a resource that gives you fine-grained information for scraping Basically you make the route using the power of regex and linker D takes from there giving you reports that per route basis Thanks to this you can get those sweet sweet timeouts and retries Problem is that you have to generate the profile come through 500 lines of yamble add in what you want for the retryer timeout And then deploy it from there We wanted to take it a step further while still leveraging what we had and keeping the services engineer involvement minimal The setup should be out of the box when they spin up a new service And it should only have to change values in one place to leverage the features from Lincordy The other problem we started to face was we had this scary new concept out there that until now The engineers didn't really have to mess with it. It just existed and they got a lot of cool stuff out of the box So whenever something would break now that they've integrated with these new things We'd start to get messages like is Lincordy down We would go and check it would come out to be that there was an issue pulling an image or there's a config change So the proxy sidecar would never get healthy, which would trigger alerts for the effort team so They would come to us and said that the services team So we needed to find a way to build knowledge and confidence in our engineers for using Lincordy Thankfully one of the first things we did in the beginning of the journey was create a template project that all the other Services were created from Using an open-source tool it became plug-and-play to get a new service up and running with all of our tooling already built in So now we insured consistency across all of the services So when new tools had to be added it was a pretty painless process and there's no guessing where everyone was One of the shared libraries we were already using was spring fox, which would give us swagger docks We found an open-source tool that would take our swagger spec and output the service profile We made a few changes to the client baked it into our docker build pack And we were able to make an extension to be leveraged by the services So now in one place our engineers could get retries and timeouts Couple this with our custom chart and measuring the service became a two-line affair When they were ready, they would flip the switch in their charts and the service would be meshed in the next deployment Leveraging flagger another third-party tool were able to also add canary deployments So as long as the service was meshed They would now get golden metrics as well as fine-grained details for their service to be leveraged by the canary And having learned our lessons We added in a way for the engineers to be able to skip the analysis and deploy the service in the manner previous to canaries existing This would be in cases where there is a breaking change going out or some kind of incompatible schema change Now it was time to build the confidence throughout the teams when we found the check command and linker DC LI We knew it would be pretty easy to automate and alert based on the output with 14 lines of bash and a cron job We were able to create alerts based on the help of linker D The next steps will be the self-heal based on that output most likely for things like certificate expiration a Big part of our development process in our group is introducing chaos to the services breaking things in new and exciting ways In non-prod There's a certain kind of glee that is captured in an engineer's heart when they're told they're allowed to break something deliberately So with Linger D. We came up with our hypothesis and our testing criteria and we went to work We're able to test and verify that if we scale the pods to hundreds of instances the proxy will still be injected While doing this we end up discovering a connection issue completely unrelated to linker D. We're able to remediate it and keep on going We're also curious what would happen if while we were scaling service pods Would the traffic be distributed correctly if the linker D destination pods were a less than desired replica state Through our tooling. We're able to watch the traffic being directed without any scaling interference Our big fears were around the proxy injector in the control plane though We're able to bring down the injector and still have our function our services function They would not be able to scale or deploy new versions, but they're still working This brought us relief in that we knew we would have time to fix the issue if this were to occur in production Finally if we brought linker D completely down We were able to verify similar results as the proxy injector being down There'd still be issues of scaling deployments, but The services would still be able to communicate This was the level of peace of mind that we were looking for this is what was going to let us sleep at night And it's hard to put a value on that the wins we get from linker D have been pretty awesome Canaries are now in place by most of our services giving us another layer of protection when deploying We've created the gating possible to be able to begin deploying during business hours, which in the past was unheard of What was great about this integration is it was with third-party and it was just another instance of it just working with linker D Some of the immediate benefits we saw was during an active incident where we were able to verify Immediately in our dashboard that our services were receiving the traffic correctly sending out our requests, but never getting anything back Because of this innocence was proven and we were able to work with the teams necessary to resolve the issue at hand The benefits are made even better by the fact that it has not disrupted our workflow We continue to document as we did before add a couple extra lines and we've set up the necessary items Same day fixes have become easier and faster because we've got these protections in place now We're also able to reduce the amount of big bang deployments, which were always error-prone To my team this journey has definitely been worth it The benefit of all of the efforts that Garrett just spoke to are that we were successfully able to accelerate the completion of our microservices We went to our first store in early July and since then have been rolling out to the remainder of the company This has allowed us to do our part to make sure stores were there for our customers in the in their time of need We've been working on building our operational competency of our new microservices while taking key throughput off of The monolith that's still in place Which has helped with our reliability and resiliency? we've been learning how to best observe support and Deliver safely and quickly to our new services thanks to the new tools that are available to us with our linkerd service mesh Thank you for attending today if you have any questions or want to discuss further Feel free to reach out to Garrett or myself directly if you want to learn more about what we're doing at H.E.B. Check out digital dot H.E.B. Dot com for a blog of several of our technical efforts Again, thank you, and we appreciate you