 From San Mateo, it's theCUBE, covering Scalar Innovation Day, brought to you by Scalar. Hello everyone, welcome to the special Innovation Day with theCUBE here in San Mateo, California, heart of Silicon Valley, I'm John Ford, the Cube, our next guest is Steve Newman, the co-founder of Scalar. Congratulations, thanks for having us. You guys got a great company here. Thanks, yeah, glad to have you guys here. So tell the story, what's the backstory? You guys founded, interesting pedigree of founders, all tech entrepreneurs, tech savvy, tech athletes as we say, tell the backstory, how did it all start and how did it all come together? So I guess I traced the story back to, I was part of the team that built the original Google Docs and a lot of the early people here at Scalar either were part of the Google Docs team or other people we met while we were at Google. And really Scalar is an outgrowth of the, it's a solution to problems we were having trying to run that system at Google. Google Docs of course became part of a whole ecosystem with Google Drive and Google Sheets and all these applications working together is a very complicated system and keeping that humming behind the scenes became a very complicated problem. Well congratulations, Google Docs is used by a lot of people, so it's been great success. Scalar's different though, you guys are taking a different approach than the competition, what's unique about it? Can you share kind of like the history of where it's going and where it came from and where it's going? Yeah, so maybe it'd be helpful like just to kind of set the context a little bit. Through the blackboard. Yeah, so I talked about it's kind of put a little flesh on what I was saying about this very complicated system that we were trying to run in the whole Google Drive ecosystem. There are all these trends in the industry nowadays that move to the cloud and microservices and Kubernetes and serverless and can use deployment. Every, like these are all great innovations makes people are building more complex applications, they're evolving faster, but it's making things a lot more complicated. And to make that concrete, imagine that you're running an e-commerce site back in the dot com web 1.0 era. So you're gonna have a web server, maybe a patchy. You've got MySQL database behind that with your inventory and your shopping carts. You may be an email gateway and some kind of payment gateway. And that's about it, that's your system. Each one of these pieces involved, going to fries, buying a computer, driving it over to the data center, slotting it into a rack. You know, a lot of sweat went into every one of those boxes, but there's only about four boxes. That's your whole system. And if you wanted to go faster, you threw more hardware at it, more RAM. Exactly, and not literally through, but literally carried. You literally brought in more hardware. And so, took a lot of work just to do that simple system. Fast forward a couple of decades, if you're running an e-commerce site today, well, you're certainly not seeing the inside of a data center. Stripe will run the payments for you. Somebody Amazon will run the database server and say, you know, like, this is much, much, you know, one guy can get this going in an afternoon, literally. But nobody's running this today. This is not a competitive operation today. If you're an e-commerce today, you also have personalization and advertising based on the service history or purchase history. And, you know, there's a separate flow for gifts and, you know, then printing the, you know, interfacing to your delivery service. And, you know, you've got 150 blocks on this diagram. And maybe your engineering team doesn't have to be so much larger because each one of those blocks is so much easier to run, but it's still a complicated system. And trying to actually understand what's working, what's not working, why isn't it working, and dragging that down and fixing it, this is the challenge today. And this is where we come in. And that's the main focus for today is that you can figure it out, but the complexity of the moving parts is the problem. Exactly. So, you know, and so you see, oh, you know, 10% of the time that somebody comes in to open their shopping cart, it fails. Well, you know, the problem pops out here, but the root cause turns out to be a problem with your database system back here, and figuring that out, you know, that's the challenge. Okay, well let's grab a seat and continue. So with cloud technology, economics has changed. How is cloud changing the game? So it's interesting, it changes the game for our customers, and it changes the game for us. So for our customer, you know, kind of we touched on this a little bit, like things are a lot easier, people will run stuff for you, you know, you're not running your own hardware, you're not, you know, you're, often you're not even running your own software, you're just consuming a service. It's a lot easier to scale up and down, so you can do much more ambitious things, and you can move a lot faster, but you have these complexity problems. For us, what it presents an economy of scale opportunity. So to, you know, we step in to help you on the telemetry side. What's happening in my system? Why is it happening? When did it start happening? What's causing it to happen? That all takes a lot of data, log data, other kinds of data. So every one of those components is generating data, and by the way, for our customers, now that they're running 150 services instead of four, they're generating a lot more data. And so traditionally, if you're trying to manage that yourself, running your own log management cluster or whatever solution, you know, it's a real challenge to, you know, as you scale up, as your system gets more complex, you've got so much data to manage. We've taken an approach where we're able to service all of our customers out of a single centralized cluster, meaning we get an economy of scale. Each one of our customers gets to work with, basically a log management engine that's to scale to our scale rather than the individual customer scale. So the older versions of log management had the same kind of complexity challenges you just drew about e-commerce. As the data types increase, so does their complexity. Is that right? So the complexity increases, but you also get into just a data scale problem. You know, suddenly you're generating terabytes of data, but you only want to devote a certain budget to the computing resources that are going to process that data because we can share our processing across all of our customers. We fundamentally change the economics. It's a little bit like when you go and run a search on Google, thousands, literally thousands of servers in that 10th of a second that Google is processing the query, 3,000 servers on the Google side may have been involved. Those aren't your 3,000 servers. You know, you're sharing those with, you know, 50 million other people in your data center region. But for a millisecond there, those 3,000 servers are all for you. And that's a big part of how Google is able to give such amazing results so quickly, but instill economically for them. And that's basically, on a smaller scale, that's what we're doing is taking the same hardware and making it all of it available to all of the customers. People talk about metrics as the solution to scaling problems. Is that correct? So, this is a really interesting question. So, you know, metrics are great. You know, basically the, you know, and if you look up the definition of a metric, it's basically just a measurement, a number. And, you know, and it's a great way to boil down, you know, so I've had 83 million people visit my website today and they did 163 million things in this area and that succeeded and da, da, da, da. You can't make sense of that. You can boil it down to, you know, this is the amount of traffic on the site. This was the error rate. This was the average response time. So, these, you know, these are great, it's a great summarization to give you an overall flavor of what's going on. The challenge with metrics is that they tend to measure, they can be a great way to measure your problems, your symptoms. Sites up, it's down, it's fast, it's slow. When you want to get to then the cause of that problem, all right, exactly why is the site down? I know something's wrong with the database, but what's the error message? What's the exact detail here? A metric isn't gonna give that to you and in particular when people talk about metrics, they tend to have in mind a specific approach to metrics where this flood of events and data very early is distilled down. Let's count the number of requests, measure the average time and then throw away the data and keep the metric. So that's efficient, you know, throwing away data means you don't have to pay to manage the data and it gives you this summary, but then as soon as you want to drill down, you don't have any more data. So if you want to look at a different metric, one that you didn't set up in advance, you can't do it. And if you need to go into the details, you can't do it, an interesting story about that. You know, when you were at Google, you mentioned a lot of problem statements came from Google, but one of the things I love about Google is they really kind of nail the SRE model and they clearly decoupled roles, you know, developers and site-reliably engineers who are essentially one-to-many relationship with all the massive hardware. And that's a nice operating model, it's had a lot of efficiencies, it was tied together. But you guys are kind of saying, in a way, does developers use the cloud? They become their own SREs in a way because the cloud can give them that kind of Google-like scale in smaller ways, not like Google size, but that similar dynamic where there's a lot of compute and a lot of things happening on behalf of the application. Or the engineer, the developer. As developers become the operator through their role, what challenges do they have? And how do you see that happening? Because that's an interesting trend because as applications become larger, cloud can service them at scale. They then become their own SREs. What do you, how does that roll out? What's, how do you see that? Yeah, so I mean, and so this is something we see happening at more and more of our customers. And one of the implications of that is you have all these people, these developers who are now responsible for operations. But they're not special, you know, they're not that specialist SRE team. They're specialists in developing code, not in operations. They're, you know, they minor in operations. And they don't think of it as their real job. You know, that's the distraction. Something goes wrong. All right, they're called upon to help fix it. They wanna get that done as quickly as possible so they can get back to their real job. So they're not gonna make the same mental investment in becoming an expert at operations and an expert at the operations tools and the telemetry tools. You know, they're not gonna be a log management expert, a metrics expert. And so they need, they need tools that have a gentle learning curve, have a gentle learning curve and are gonna make it easy for them to get in, not really know what they're doing on this side of things, but find an answer, solve the problem and get back out. And that's kind of the concept you guys have of speak to truth. Exactly. And we mean a couple of things by that. Sort of most literally, our tool is a high performance solution. You hand us your terabytes of log data. You ask some question, you know, what's the trend on this error and this service over the last day? And we give you a quick answer. Big data, scan through it, give you a quick answer. But really it's, you know, that's just part of the overall chain of events which goes from the developer with a problem until they have a solution. So they have to figure out even how to approach the problem, what question to ask us. You know, they have to pose the query in our interface. And so we've done a lot of work to simplify that learning curve where instead of a complicated query language, you can click a button, get a graph and then start breaking down that, just visually break that down, which okay, here's the error rate, but how does that break down by server or user or whatever dimension and be able to drill down and explore in a very kind of straightforward way. How would you describe the culture at Scalar? I mean, you guys been around for a while, you're still a growing, fast growing startup. You haven't even done a B round yet, got an A round. You guys self-funded it, got customers early, they pushed you, getting now 300 plus customers. What's the culture like here? So, you know, this has been a fun company to build in part because, you know, the heart of this company is the engineering team. Our customers are engineers. So, you know, we're kind of the same group and that keeps the, you know, it kind of keeps the inside and the outside very close together. And I think that's been part of the culture we've built is, you know, we all know why we're building this, what it's for. You know, we use Scalar extensively internally, but even if we weren't, it's the kind of thing we've used in the past and we're gonna use in the future. And so, you know, I think people are really excited here because, you know, we understand why. And you have an opinion of the future on how it should roll out. What's the big problem statement that you guys are solving as a company? How would you boil that down if asked by a customer and engineer out there? What real problem are you solving? That's core problem, big problem, that's gonna be helping me. You know, at the end of the day, it's giving people the confidence to keep, you know, building these kind of complicated systems and move quickly. Because, and this is the business pressure everyone is under. You know, whatever business you're in, it has a digital element. And your competitors are doing the same thing and they're building these sophisticated systems and they're adding functionality and they're moving quickly. You need to be able to do the same thing, but it's easy then to get tangled up in this complexity. So, at the end of the day, we're giving people the ability to understand those systems and... And the functionality and the software is getting stronger and stronger, more complicated with service meshes and microservices as applications start to have these ability to stand up and tear down services on the fly. That's only gonna yield more data. Exactly, you get more data and it gets more complicated. Actually, if you don't mind, there's a little story I'd like to tell. So, hold on just while I clear this out. So, this is gonna go back to Google and again, you know, kind of part of the inspiration of, you know, how we came to build Scalar. And so, this is gonna be a story of frustration, of, you know, a problem we got ourselves into that... Frustration and motivation. Yeah. So, we were working on this project. It was building a file system that could tie together Google Docs, Google Sheets, Google Drive, Google Photos. And the block diagram looked kind of like the thing I just erased. But there was one particular problem we had that took us months and literally months and months and months to track down. You know, you'd like to solve a problem in a few minutes or a few hours, but this one took months. And it had to do with the indexing system. So, you have all these files in Google Drive, you wanna be able to search. And so, we had modeled out how we were gonna build this search engine. You'd think, you know, Google search is a solved problem. But actually, so, Google Web Search is for things the whole world can see. There's also like Gmail Search, which is for things that only one person can see. So, it's lots of separate little indexes. Those were both solved problems at Google. Google Drive is for things a few people can see. You share it with your coworker or whoever. And it's actually a very different problem. But we looked at the statistics and we found that the average document, or average file was shared with about 1.1 people. In other words, things were mostly private or maybe you share with one or two people. So we said, we're just gonna make, if something's shared to three people, we're just gonna make three copies of it. And then now we have just the Gmail problem. Each copy is for one person. And we did the math on how much work is this gonna be to build these indexes. And in round numbers, we were looking at something like at the time, this would be so much larger now, but at the time we had maybe one billion documents and files in the system. Each one was shared to about 1.1 people. Maybe it was a thousand words long on average and maybe it would change, be edited once per day on average. So we had about a trillion word updates per day, if you multiply all that together. And so we allocate, we put a number request and purchased machines to handle that much traffic. And we started bringing up the system and it immediately collapsed. It was completely overloaded. And we checked our numbers and we checked them again. Yeah, 1.1, about a billion, whatever. But the work into the system was just way beyond that. And we looked at our metrics, so measuring the number documents, measuring each of these things, all the metrics looked right. To make a month's long story short, these metrics and averages were hiding some funny business. There turned out there was this type of use case where you'd have occasional documents that were shared to thousands of people. And there was a specific example. It was the sign up sheet for the Google company picnic. This was a spreadsheet. It was shared to about 5,000 people. So it wasn't the whole company, but a big chunk of Mountain View, which meant it was, I don't know, let's say 20,000 words long because it had the name and a couple of things for each person. This is one document, but shared to 5,000 people. And during the period people were signing up, maybe it was changing a couple thousand times per day. So you multiply out just this document and you get 200 billion word updates for that one document in a day where we were estimating a trillion for the whole earth. And so there was something like 100 documents in this category. So Google was hamstringing your own thing. We were hamstringing our own thing. There were about 100 examples like this. So now we're up to 20 trillion and that was the whole problem, these 100 files. And we would have never found that until we got way down into the details of the logs, which we didn't have the tools to do. And this took months. Because we didn't have the tools, because we didn't have Scalar. Yeah, yeah, yeah. And I think this is the kind of anomaly you might see with web services evolving with microservices, where someone has an API interface with some other SaaS as apps start to rely on each other. This is a new dynamic we're seeing as SLAs are also tied together. So the question is, whose fault is it? Exactly, you have the whose fault is it? And also things get so much more varied now. Again, web 1.0 e-commerce, you buy a thing, you buy a thing, that's all the same. Now you're building a social media site or whatever. You've got eight followers, you've got eight million followers. This person has three movies rented on Netflix. This person has 3,000 movies. Everything's different. And so then you get these funny things hiding. Yeah, you're flying blind if you don't get all the data exposed. It's like a blind person trying to read Braille, as we heard earlier. Steve, thanks so much for sharing the insight. Great story. I'm John Furrier, here for the CUE for Innovation Day at Scaleless Headquarters. Thanks for watching.