 Hello, and thanks for coming to my talk at service mesh con EU 2021 I'm going to be talking about service mesh and This is a topic that I've been very interested in for a long time now I got involved in the ongoing is to communities back before you see I was even GA and I've been working with customers and Users of Istio and other service mesh is actually Across the world at large and small organizations helping them operate and production eyes their service mesh and and continue to modernize their application infrastructure and Unvoy proxy is the center of this talk today Because a lot of service mesh is used on boy under the covers and for good reason Unvoy is an open-source project to an NC plus plus third fast very feature rich and Was built with a dynamic API to drive its configuration So no flat files and hot reloading and all this stuff But on boy proxies or any service proxy under the covers in a service mesh kit is on the request path and can be kind of complicated On boy can be run as a sidecar as I've mentioned in a service mesh pattern where the proxy lives With the application instance in Kubernetes This would be with a pod in a VM world. This would be on boy living on a VM on boy can also live at the edge of a boundary and do reverse proxying type type load balancing and routing and so forth and Unvoy when it's in the request path, especially for developers who are not familiar with on boy might see it as a Black box and this is not specific to on boy. There's any any service mesh Day to plane proxy. So what happens when things start to slow down or don't behave the way you're expecting? If there are issues, how do you actually troubleshoot it and that is one of the most important and sometimes the hardest part of operating a service mesh, which is understanding the data plane So in this talk, we're going to take a look at a few tips and tricks that we have learned and I have learned Over the years helping folks operationalize on boy based technologies like a service mesh So the first thing you should know is that on boy is not a black box on boys a white box You can see into it and you can Glean a lot of information about what's happening inside the proxy on the request path over time So first of all, there's an admin interface an ACP admin interface that you can Query for things like certificate information things like what what upstream clusters that are Routable and what endpoints make up those clusters Things like changing the log levels, which will become very important as you're trying to debug things as the requests are going through on boy and really important features like being able to tap the request and response chains as well as being able to do profiling and memory dumps, so there's a lot of really Important functionality if you're going to operate on boy if you're going to run a service mesh you should be familiar with getting to this admin interface when you need it and Understanding how to query it and how to get information out of it the next thing you should know about operating on boy is in the certs part of the admin interface or in the stats Part of the admin interface. There are indicators that will give you a heads up when things might start to go wrong so one of those things that we've found the hard way is Keeping an eye on when certificates are about to expire Now tools like a service mesh site like Istio might have a way for for doing automated rotation of certificates but sometimes that doesn't happen properly and checking these certain stats Whether you have automation in place or not To determine, you know, if you're getting closer and closer to a cert expiring and it not being rotated when you expect it extremely important Another is what happens when the envoy proxy starts to come under severe load What is the behavior that you expect? we don't want the envoy proxy scissors to just lock up and behave, you know in a state that we can't understand is unpredictable so we can set things like the the overload manager in envoy and Observe it and watch it. How are what is the pressure? What is the memory or CPU pressure that envoys? At right now, and what should it do when it gets to certain thresholds? Should it? Stop doing keep alive on certain connections if you reach a certain threshold. Should it stop accepting requests? Should it start shedding requests? and so envoy has this this feature called overload manager that allows you to Ahead of time specify what happens when the proxy starts to become under pressure so you can start to reason about understand What's what's happening? Here's a here's a snippet or example of what it might look like that, you know You specify we're going to monitor the the heat space and if it gets to 95 percent or crosses that High watermark, then we're going to disable keep alive And so what that means if if there are connections long live connections open is not really much happening on them But we have keep alive there so that it keeps those connections open We're going to we're going to stop to keep alive and those connections should eventually get cleaned up if we hit 99 percent Or across that high watermark, then we will stop accepting new requests and so on right so you can check the documents But there's a really powerful feature and when running on boy another one is being able to log out to Standard hour to a logging file when requests are coming into the system and log metadata about each of the requests Things like certificates things like X forwarded for headers things like requests and response Details of the of the message and headers and so on all extremely extremely useful for Debugging and troubleshooting when things start to go wrong in an envoy based environment Another incredibly important piece to this puzzle is not just the access logs which we talked about so request comes in We can log metadata about that, but it's also the proxy itself enabling so the so on boy has extremely Detailed logging levels or rather different modules inside of envoy that you can enable logging levels for and some of the common ones around connection connection handling What the filter chains are doing? How routing happens how RBAC policies are applied and so forth those can all be enabled at very fine-grained levels and Things are not connecting for TLS issues for some reason go check the envoy logs That will give you a very good hint if not exactly why things are not working and then One of the last slides is you're tuning for cloud deployments, especially in a public cloud What one of the most common one that we've been running into and I've been bit by for the last several years really is Is how the load balancers behave in a public cloud? So if you end up running envoy proxy on a VM and you end up running a data or a control plane somewhere else You know connecting those things with Amazon load balancer. You could see very unpredictable results so being able to tune things like keep alive and Session and stream Keep alive these things are are very important Here's a little snippet for how you might do that for an upstream cluster So maybe you're connecting to the control plane for configuration updates or you're connecting to The an ExDoS service part of the control plane To determine authentication or authorization policies when a request comes in configuring these things is extremely extremely important and You know like I said, these are all these are all tips and tricks that hopefully help you We've learned them the hardware hard way. We have the the scars to show that and if you're interested in running envoy based technology either at the edge as in an API gateway or a service mesh like Istio and You're looking for a way to make that successful and simplify doing that Please reach out to us at solo.io. That's exactly what we work on. That's what we Specialized in it's where our core competency is and we'd be happy to to help so with that Thank you and go enjoy the rest of the talks in service mesh con