 Hello everyone and thanks for tuning in for this episode of New Age Engineering. My name is Adam Furmaneke and today I have two fantastic guests with me. First is Roy Kriger, CEO of Metis and with us is also Shimon Toad, CNCF Ambassador, hello folks. Hey, it's great to be here. Thanks for having me. Lovely. Hey, I know. So today topic is DevOps challenge when scaling your production or company. So this is what we are going to talk for the next 40 something like minutes. But before jumping straight to it, how about we introduce our guests? So Roy, maybe a couple words about yourself. So hi, nice to meet you. My name is Roy. I'm the CEO of Metis, building developer tools around database based in Israel in a city called Petastikva, which is basically the New Jersey of Tel Aviv. Yeah, married, small kid. Cool. Glad you're here with us. Shimon, how about you? Yes. So hi everyone. Thanks for having me. So I'm a CNCF ambassador. My background is in DevOps and infrastructure. I was a general manager of leading infrastructure divisions for a media company. Also I have a YouTube channel called ShimonOps and I make educational videos about Kubernetes and open telemetry and all of the cloud native workloads. And really I fell in love with open source and programming when I was 13 years old, when I installed my first Linux and ever since I've been geeking around software development and how to scale companies. That sounds cool. By the way, I highly recommend Shimon's channel. Let's do a bit of promotion now for Shimon as well. Lovely. I'll put a link to the channel and to other stuff we mentioned in this episode in the notes attached to this video. Okay folks, so the topic for today is scaling companies, scaling products, challenges, other stuff. When I think about that, I actually don't even know what's the difference between scaling the company and scaling the product. Isn't it the same? How does it change? Shimon, do you want to start? You have worked in one of the highest growth companies in Israel, so maybe you can share some experience and insight around that. Yes, so actually I've worked at a company that's now merged with Unity, the great company that is the maker of the tools for software, for game development and when I joined the company we were just 30 people and when I left we were a thousand people. Part of my job was to help scale the organization and the infrastructure and I think that those are two different things because on the one hand you need to scale people and processes within the organization in terms of who reports to who and how to work and how do we make tickets and how do we make sure that we work on the right priorities and then on the other hand you have scaling engineering where is we have more customers onboarding, we have more traffic coming from the same customers and how do we make sure that our production is scalable enough to hold all of those changes and at the end of the day I think that both of them collide because you need your system to be elastic but you also need to have the ability to work on it in an easy manner. I would say a monolith with 1000 engineers might be a little bit challenging to work on, for example. Yeah, I totally agree with Shimon, like from my experience and the way I see it is that one cannot live without the other, like you cannot scale literally your production environment without changing like the processes, like you start with everybody's pushing code, you're just trying to get to the market as soon as possible and then suddenly you understand that you have product market fit or starting to have product market fit and then you feel like the entire air being sucked in and like a huge pool of motion on your product and everything collapse so you're trying to get as many developers or engineer and support and all those organizations but at the same time you need to change the way you build your production environment to accommodate like that, the growth so you need to have the right people and the right processes and also I think something that is not really being talked about enough is the right tooling. How do you choose what tooling, what gates are you doing and this is I think maybe part of the processes Shimon mentioned but I would argue it's also part of like I would call it literally this is the enablement for growing your production environment otherwise it's going to break all the time. So this reminds me when you when you mentioned tooling and scanning product and other stuff we all know in computer science there is a thing called Conway's law so that the software architecture resembles the organization architecture and I remember I think it was Netflix that said that they started doing microservices not because of the product but because they couldn't manage people anymore and what do you think about that like you mentioned tooling but how does tooling apply to that should we also change our tools to microservices all of that like when we scale our product or any differences in that like really what's your call on this? So like I totally agree because when you it used to be like you had like maybe companies with 20,000 engineers and you know I'm talking old school enterprises but the law of residential efficiency was so low so the 28,000 engineers barely had the ability to contribute to the business and I think like what you mentioned around Netflix it's the only way if you have the ability to scale out and the way you scale out is creating siloed or semi-siloed services or topics that certain types of engineer are independent and you have you let them and give them the right doing and you know the ability to move fast and have self-service we are all our entire industry is all about now self-service and for a reason you cannot move fast in today's so competitive environment without letting teams whether it be customer success or engineering this is something we're trying to do within Mattis as well have the ability to decide things by themselves be a decision-maker and the same time have the independence but you have to keep some kind of governance around that and I'm sure Simon can tap around that and basically he managed a huge organization I'd love to hear it talks about it but for my experience that's I think maybe I'll draw like both sides so like there is the one I think like showing both extremes helps us understand where could be a good the middle ground so on the one side you can have let's say a thousand engineers working on one monolith and then there is a lot of dependency on one another because everyone is such stepping on everyone's toes and it's really really difficult to synchronize everything on the other hand you have the other side of the spectrum which is a everyone does whatever they want and everyone builds whatever they need but then you have a lot of duplication so you don't have any platform engineering team that actually says okay I have ten microservices that need I don't know authentication authorization observability and instead of ten teams building it ten times will have one team that builds it and everyone will use it so and some area there is a equilibrium that makes it good and I think that an interesting point is Jeff Bezos when he talked about AWS he said I'd rather have five of the same thing instead of zero and in AWS when you look at services you have a I'm very close with AWS you have a GM for a service there's a product lead there's an engineering lead and they in general get autonomy and they can go and and really move fast of course they need to adhere by the security and compliance and all of that but it is not unheard of that like what you have service a and on the edge of it you have another service that does the same thing on the other edge of the service and they almost sometimes compete by themselves and I think we need to strive to get to this equilibrium but this is actually the same case we have in software engineering right when like there is this don't repeat yourself dry rule right how do I know when to like stop repeating myself or how do I know whether I can build something independently which would be a duplication according to what you're saying so I think it it comes back to the synchronization and and I think that once you identify and you say okay we have five product teams and one infra team platform team that is going to be responsible and and the tough thing is how do I make sure that I don't repeat myself five times before I understand that it has to be you know across the board and on the other hand how do I not do premature optimization in order to you know build some crazy thing that will be only built once and and I guess that you know harmony and sinking between the different teams if we look at Conway law and like talking between the organization itself is very very important when I was a GM of an infrastructure division every week I would sit with the VP R&D's of all the teams and I would ask them what are you working on what are you working on what are you working on and by that by understanding what each one of them is working on it would help me say hey you and all of the rest need an event collecting system maybe I should build it for you instead of each one of you building it by themselves I think that like from my experience and I'll tap on the last sentence you know just said is it's in in a certain scale it's inevitable that you're going to have duplication if you really want to move fast and your your target optimization is like running fast and you are paying with inefficiency around that but it's boiled down from my experience to transparency and honesty like if a company really has transparency it won't eliminate many things but people will understand what I know what other divisions are doing to some extent and you reduce like the duplication and inefficiencies you have in the company so from my experience you can it is also a structural issue but it's more of a cultural issue to have the ability to reduce the amount of duplication but let's talk tech okay so I think that there are some some things that you can do that will help you make the right decision so number one if you say each service is API compliant so everything talks you decide whatever you want but you could say every service has an HTTP endpoint or reads from a queue and this is how you communicate with it and then it allows for other services to use it as an infrastructural service and then every every piece of code that is being written is like productized and can be consumed by another service and which is very different than writing something that is closed and then if you want to change you need to actually request from the other team and you need to somehow connect with it so if we make it API first and then it will be much easier to integrate okay so that's one thing API first what other things can we put in place then that would I actually don't know what we are targeting like we accepted already that hey we do have some duplication and this is inevitable as we mentioned but probably we would like to limit that to have it under control right so what are other mechanisms could we put in place to have that under control yeah so make it easy for other people to consume other services and also I think it comes back to the non-technical side which is how do you put like points of interest in the company of who is responsible for what and something that they going back to the technical side today their IDP product and which is developer portals that allow for a service catalog in discovery and to understand what you even have because like a story that happened to me is that in one company that I worked we wanted to make a web app that will be a desktop application and at the time there are two main technologies that you could do it with there was electron and there was another one and we like we chose for some reasons the other one and then we started working on it and then when we were in the elevator we heard someone say electron and we're like did you say electron yeah like do you work with electron yeah of course for like two years now like what how did we choose the competitive technology I just didn't know and it just happened to be in the elevator that I found out that we're now like building on two competitive technologies so somehow being able to have a catalog might help us I think yeah I agree that like catalog might help but I think that I'll repeat myself again it's the end of the day it's just like she wouldn't presented it is a cultural thing like or transparency thing if you know that company uses a certain technology so you probably go with the team and ask them about the technology and probably you're going to opt for the same technology and not adopt for different technology usually but in the end of the day I think like she won't for my experience that's what you pay when you're doing shift less and give autonomy for developers that you don't have like a perfect world right so you pay in one thing and you you earn or you get value on the other side and usually those two collides that you have to choose on each stage of the company what you prefer and how to limit certain aspect of autonomy versus like speed and velocity I want to give another example so I'm one of the creators of the open source the tree which prevents misconfigurations in Kubernetes and when I talk to custom like to companies that use Kubernetes in scale so there's always like two sides and the site is like the DevOps team and the security team and I ask them are you a community police or are you a sheriff and what is that question and the question is like one organization might have developers and they can do whatever they want and they have DevOps and security as community police they tell them like this is the good thing to do we hope that you do this and that and then there's the other way around where it's like the sheriffs and you go like everything's locked this is what you're only allowed to do and if you want something else you need to open a ticket and get access from us and of course both approaches bring bring troubles right each one has their own the limitations and capabilities were like on the one hand when you everything is open it's like the Wild Wild West and it's very hard for the security and DevOps teams because they're like they don't know what's going on and on the other hand when you have the more strict approach it's very hard for the R&D teams because they're crippled down and they're also bombarding the DevOps team with tickets to have some changes or modifications and guess what the DevOps team it's not like their their best use of time to like do tickets change this memory limit or have this configuration change and I guess that guardrails is something in the middle where we have rules so as long as everyone works within the rules it might help so that's a topic we might talk about yeah what comes to my mind is I once worked in a company that was kind on the sheriff side as you described the problem was that well there were so many teams working on products independently then when you went to start a new product you obviously had some templates or some you know bootstrap projects that would make this faster so you click you start a template and bang you get like immediately 20 different tickets because you use old libraries because you do something wrong because you don't have policies in a like AWS cloud permissions and whatnot and and this is like you have those two worlds that you can in theory go fast but then some sheriff comes and slaps your hands just because you did that huge you tell me his name is CISO someone on behalf of the CISO I guess yeah yeah so but on the other hand I don't actually imagine or I like can't imagine how security could be done in this area I mean we know what DevOps is we now have this trend of DevSecOps that security is kind of like merged as just in DevOps but I I don't see that prevalent I mean security is something I can't imagine how to do it properly like you know to give the freedom to developers and to engineers and at the same time make sure we do not leak data break GDPR or other stuff are there any solutions to that like you see yeah so I think that the solutions that I see is we can start from like number one is monitoring and discovery so you need to actually understand where you are and like what is your status and in many cases like we we we joked about the CISO slapping your hand unfortunately in many cases you just don't know what's going on and no one actually knows what's happening and you have no idea and maybe your database is about to explode tomorrow but you don't know or maybe one company that I work with now they're running my sequel 5.7 and turns out that in three months it's going to be deprecated you're like oops you know like everything's on fire it's like oh my god she when you've got to help us we've got to upgrade to my sequel 8.0 and like and so number one is understanding where you are number two is putting some some sort of a guardrails and maybe enforcement layers where it's like when I go to spin up on my sequel RDS instance maybe to tell me Shimon you should not spin up a new 5.7 you should spin a version 8 because we're gonna have to clean up the mess that you're doing now in three months so let's avoid this mess in the future and the last part is remediation and it's like how do I fix the issues that I have now and of course it's the most expensive and hard ones because you already have systems running on it and you have to put a lot of resources in that but what are you seeing Rui in terms of like because it's funny that I have the database example just like happened to me I heard about it yesterday and I so you know for us as I would call it company developing like the modern database stack if you will I think it's all colliding like things that we spoke about earlier in the conversation and tapping on what you just said is that the way we see it and the way like our customers and the way we see the world even the customers aside is that just like you said you have to give developers autonomy and the ability to move fast but as you scale as your company grows as you know you get into a state where you have your reputation in the market and you you know each production failure really hurts your customers and hurts your business it comes time where you can't really know everything as a leader whether it be DevOps, DBAs in our you know in our domain or even engineering leader you have to put like trust in the developer and governance and try to have them both the way to do it is to put in the right God for example we can talk about how do you do CICD the right way I'd like to hear your thoughts how much do you see DevOps leader needs to invest in right gating CICD processes end-to-end testing and whatnot and we're trying to help for example developers and companies do it the right way in terms of the database how you don't break schemas a performance outages configuration of the database etc etc etc and at the same time how do you allow the ability to monitor and see that everything works accordingly everything is ready to scale and when things do drift how do you detect it as early as possible whether it be a Kubernetes cluster or no that are you know not on par with your SLO or SLA is whether it be your database are you looking to wait until you have a disaster or you're trying to be proactive and be like ahead of the curve if you will how do you put the right observability in 2023 so yeah that's the way we see if you want to give the developers the right tooling that integrate into their modern workflows cloud native workflows and not enforcing them using like tooling that are not aligned with their processes and helping and making them do things that they don't want so I think that's the biggest challenge like for companies moving forward moving fast especially around database yeah the thing with that is like developers are already overloaded right don't you think so I mean we already put on them so much DevOps now security MLOps in machine learning applications right so we not to mention other stuff like reporting to managers and obviously maintaining or monitoring business metrics and whatnot so generally they are overloaded how can we help them I mean when I when I used to be a developer that was one of the things that always you know bothered me that hey I need to do the stuff I need to deploy it tested implemented and obviously later on I need to report to all the people that are keeping asking me hey where are you what is the what are you working on right now and like not necessarily I couldn't find a way how to make it easier and you know and and faster so I just can't imagine how you can put more guardrails on like developers doing that it's tough but I don't know if like the database maintenance should be like developers or like DevOps teams it's a question it's really a question and but I think that like if we go back to like what are correct processes so first of all I'd say that the most production outages that happened to me in my entire career it was always the database it's either a bad migration it's either like a red this that started the working from disk instead of from memory and everything just slowed down it's always the fucking database part of my French and I think that today it's become more popular to work with frameworks like DBT and having like a schemas that are a programmatically configured because I can give you the same example from the CI CD world of like how to not do it it's like how to not build a good CI CD process like install a Jenkins then configure everything manually don't revision any change and just go and click on buttons and then this and then the DevOps person leaves and someone else has to take care of it you're like I don't know what the hell is going on like the good way to do it is like I choose a platform that is had like infrastructures code for managing my CI I don't know get a back and circle whatever you like to use everything is revisioned everything is in your source control every change is being reviewed because it's a critical path of your infrastructure now I don't know why but in many cases I see companies they go like create table altered alter column you know just go doing whatever they want on their database and it's like the wild wild west and then you wonder why the migration broke everything because there was like an issue there so go back to infras code and codify in my opinion all of the things that you do with the database but I'd like to really hear like Roy what you see from because you're like the experts in databases yeah an expert it's a very strong word but like the way we see it you said about like maintaining the database like we don't look at today's world like when you say maintaining the Kubernetes cluster it's not about maintaining it's about like treating it like the right way and like you treat your kids right you want to make sure they are being fed quote unquote with the right food so you want to have the ability to push the right code the right configuration the right channels making sure they are aligned according to the way you see the company moving forward and the company's needs and it's about like making sure your product your production is healthy and you know address issue that needs to be addressed just in time okay so we don't look at it you know in the database perspective now going back to the database about maintaining we do think that into this world it's inevitable and you see in the end of the day if your production breaks your database production is outage you know the the person that would go you know figure out what happened you know and it's going to be under stress is the developer because usually it won't be about high availability with you know managed services it would be around queries exploding indexes temporary files and you know the devils don't have the right devils personal usually are not responsible so that's for that so the way we see it in a modern company can have each service a database you can have a big database with several schemas and each team can needs to like make sure it doesn't break the specific schema when pushing code and monitoring it detecting drift like an observability data dog observability for your application and basically have the responsibility to fix things and they need to have the right tooling to basically do that and we think that in today we are already position the market for developers to really have the the responsibility on the end to end including the database so one question that I have for you really what are the top like issues that are being caused like from the you know it give me specific examples of the top issues that are happening so you have developers for example you know one obscure or issues that keeps happening to our customers is like for some reason the database statistics is not being updated or hasn't been updated in a while maybe it was because of configuration maybe it was bad deployment and you know it's not that something has changed just the statistic was not updated and the thing that happened is that suddenly the optimizer thinks okay that's the statistic I know about so one query can take now 10 minutes or I'm exaggerating for a minute because the statistic is not up to date with what the database is currently doing so basically you have locking of your data is basically your customer can access their data which is something horrible this is one thing we have developer just using an ORM and having the n plus one day after the day after the day we keep seeing it like developer using ORM dev teams and say you know what I don't want to know anything about it but that's the thing it works when you are on day zero but it will never work once you have your 10 15 20 customers I can't imagine the company you work that with 1000s engineer working blindly with ORM horrible decision is it ever because when you put it in it works perfectly and then as you add more and more customers it aggregates and aggregates until it just breaks and like you didn't change anything it's crazy yeah and not only that you see a sequel which is slow but you never wrote a sequel you wrote a JavaScript you use an entity what do you mean I never all do this thing I know nothing about it like how do I fix it and the third thing is like developers not being mindful of how they deploy queries if you go to a relational database thinking about not only about like what kind of data I'm returning but how do I do it we have like companies finding that scanning five tables million rows and returning two rows like which is crazy and doing it on the query that runs thousands of times like a day so just on top of my head those are three things that keep coming back and it bites you in the worst time on Christmas Eve on your Friday's nights yeah so but not only that think about configuration and I don't want to be very long about what we're seeing happening around steam and migration losing your data need having that the need to revert the database snapshot which is crazy for a company you can be very sympathetic about the notion of reverting to previous snapshot of your database yeah talking about migrations recently I met a guy from one of the I don't want to give names but that was one of the biggest companies that you probably go to their website when you go for vacations and want to find a place where to stay for the night so he told me that they have they had a migration that was running for over a month that wasn't expected and yeah if that breaks in the middle you are in big troubles I wish there will be a day that we will have a migration running for a month cool and so generally seems like we covered quite a lot like we mentioned various challenges we mentioned guardrails we mentioned database issues and a like wrap-up thoughts you'd like to share with the audience anything that comes to your mind at the end yeah I think that the number one issue of companies is like premature optimization so like it's totally okay to build things in a non-scalable way and once you hit product market fit and once you get the customers no worries there are a lot of companies that can help you with scaling and I think that at this point you need to start looking at what you build and start a like peeling off the onion and they peel by peel and and you know taking it and making it into separate services and just treating it as a specific unit and this way by separating them you can get into our ownership and stability by different areas and and don't do it too early because a company that does it too early won't get into the growth phase I guess I agree but I totally agree with just what she once said I don't need to add about it I'm just going to say that once you hit the scale for like my advice would be when you really hit the scale I'm not talking about few customers it's okay to stop for a second and prepare for the next step and really prepare in terms of like we talked about all those observability and CICD don't be afraid to call it top priority in the right time just like she once said there is a right time for everything but more times than none I see companies afraid to call a priority on doing observability CICD and other others to prepare them like you're preparing your company for stave motions and scale in the safe don't be afraid to have it as a priority in the right time exactly like she once said that's my advice and engineering and so you said companies are afraid why are they afraid because pressure comes when they think they want to compete in the market and they need to have features and more features and you know more times than none like they think that if they want release more feature that you know the market would perceive them as you know not innovative or their what's called their competitor will you know be faster than them but sometimes like this being more the slowest part is the fastest part like because if you are going to break and you are going to hurt your reputation and we can see about what happened in terms of security oh you know what talking about it of is what happened to Atlassian to their database 30% of their customers weren't able to access the data and the product because something broke in their database you don't want to get there so sometimes like yeah it's okay to say I'm going to be less fruitful and in terms of features but it's the main feature is my production health and processes it will help me scale to the next level cool sounds good and couldn't agree more I guess okay I think it's time to call it a day then let's move on to our picks of this show so rowey where do you stay now so I live in a city called Petr Tikva and shimon as an Israeli citizen is joking which is basically like the Florida for Israel basically when you were old you're going to Petr Tikva but I'm not that old yet and yet you still retired but yet I'm not retired I just got married lovely so any good food place in Petr Tikva 100% none honestly and I'm not joking like none a very you know what may be a place called peace pit master which is like that's a good one like I think maybe that's the only one I would consider like a good place but if you ask me where would I go eat and if I want to promote something I would call it like a place called oro which is gold in Spanish but it's basically a wine bar in a neighborhood near what shimon lives and there's a small place like and if we are all about like promoting very cool places which like not trying to be like a chains and like try to serve local communities I would promote them go check them out I really like them okay cool shimon now moving on to you I heard you know some good places in San Francisco yeah so I wanted to give a more you know brother commendation so there's a great flight place called epic steak it's in San Francisco in then Barcadero it's an awesome place me and my co-founder when we were we were spent a lot of time in San Francisco we love this place and we love it so much that my co-founder has an Israeli name that no American can pronounce so he decided to add another name and the name he decided to add is epic so basically ER is epic epic yeah epic silverman yeah yeah I thought it's a because of like the gaming company right epic no no it's we were really drunk at the steakhouse and he's like you should definitely call yourself epic it's an epic name and it's a good place and long story short if you go into the white combinator listing in the book face and his name is actually ER epic the entire podcast was war only because of this story she was honestly yeah that is awesome and now we know some hidden gems of some people that is lovely and be hey Adam what is in crackle oh that's you have to share as well many good places around here one I could recommend is called a companion kuflova pod vavelam vavel is a castle in crackle and there is this place which is kept in this old communist vibe so you can go there get something like three pound chopstick chop pork and so you can eat quite a lot but the best thing is when you go to the restroom and you can hear listen to some really dry dad jokes going from the speaker so this is lovely it really makes you know the time flow a little faster if you know if you catch my drift even we gotta go to crackle definitely definitely we got 100% okay folks thank you for being on this show thanks you for attending and for everyone thank you for tuning in to this episode of new age and generating all the links all the materials we mentioned will be added to the notes and hopefully enjoyed it and you tune in for the next episode have a great thank you bye bye thank you