 Hello, welcome to TFIR and this is your host Sopran Bhartya and today we are going to talk to Jacques Nadeau, CTO and co-founder of Dreamium, a company that specializes in fully open-source data analytics software. The most interesting thing about this company is that almost 90% of their software is fully open-source and they are heavily involved with three major Apache Software Foundation projects. In fact, the product from Dreamium is based on these three products. So without further ado, let's go and talk to Jacques. Jacques, before we kind of deep dive into this interview, can you quickly introduce yourself? Yes, of course. So my name is Jacques Nadeau. I am CTO and co-founder at Dreamium. I'm also active in several open-source projects. Part of the initial team that started the Apache Aeroproject involved in Apache Calcite contributes to Apache RK sometimes. And so just to give a little background on what Dreamium is, Dreamium is an open-source self-service data platform designed to make data consumers have an easier time of accessing data and also try to reduce the load on dead engineers and make some of the tasks that they do today easier so that end users can do those things directly. When it says self-service data platform, can you just kind of, you know, what exactly is that? Yeah, so it's very simply we try to make data available to end users without having to make them worry so much about the mechanics, the sort of physical and mechanical side of data. And what I mean by that is that data may be in many systems. If I'm an analyst, I'm not that interested in what particular type of technology the data is in. I just want to interact with that data, do data science activities against it, do machine learning, do analysis against it. And so what we try to do is create what we sometimes call a Google Docs for your data, a place where you can find all the different data sets that are available to you, that you can collaborate with other people around those data sets and that you can connect those up to what other tools you'd like to use to analyze. Oh, that's interesting because the collaboration part is kind of interesting. It's like, since it's like Google Docs, it has a web interface where people can log in and then do that? That's exactly right. So the primary interface for the product is a web UI, which allows which shows all the data that might be available to you, indexes that and allows you to manipulate that data, curate it, create what we call virtual data sets which are derived versions of data and allows, basically, business users to work in a logical world where they're not that concerned about where data might be stored or what underlying system it might be, whether it's a relational system, a big data system, a no-sequence system. Okay. And so it's, as you said, it really doesn't matter, but where do you run yourself on AWS? What kind of code infrastructure do you use to host the data? Yeah, so today, Dremio is a piece of software that is used both on-premise and in the cloud. So probably about half of our users and customers are on-premise running in their data centers, half in the cloud. And it can run on a distributed system that can run in an environment, for example, let's say you're running Elasticsearch, MongoDB, and Oracle. You would run Dremio on a set of nodes that could be maybe secondary read nodes for some of your other systems or it could be a separate cluster. And then Dremio does work to try to make that data available to users. It could also be in something like a Hadoop environment. In a data lake environment like Hadoop, we can interact directly with something like Yarn to deploy and use that as sort of containerization methodology. So if you have a containerized deployment, distributed deployment strategy, then we can use that. But if not, you can just put this on bare metal. Okay, so it's like totally multi-clouded, you know, users can choose where they want to run it. Yeah, exactly. The perspective is that everybody's sort of environment, there's so many complex technologies out there that everybody's environment is quite different. And so trying to be as open to different ways of working as possible. Right. And who are your typical customers who are consuming this service and technology? Well, it's a range. So we do sell a lot to large enterprise. So we have several public customers in the Fortune 500, as well as many more that we're working on making sure that we get them public. And so people can hear about them. But so we do a lot of work there. But because we have an open source product, it's a community addition that people can download and play with or for to GitHub or whatever. We also see a lot of smaller businesses and startups who adopt their technology. So the reality is that the complexities and the governance requirements of large enterprises are obviously different than smaller companies. But all of them are suffering from the same problem, which is that it's too hard to get the data. People are having difficulty finding the number of data engineers that they need, given how they're working with data today. And so they need some technology to help them out. Right, right. Now let's just kind of quickly switch gears and talk about the open source angle here. So how much of your products are like open source? So we're about 95% open source. We have an open core model. The software is on GitHub and is Apache licensed. And we believe that that's critical for all size organizations in order for them to these data technologies to get heavily integrated into other parts of the company's sort of technology ecosystem. And so being able to understand how the code is working, extend the code, enhance it for the certain departments that you have is something that's very important to us. And so that's the Jeremio product. Now on top of that, below that, we have a strategy where we're focused on really collaborating and building communities around some of the core components that we build our technology on. So we can't do what we do without working with a number of really important open source technologies. The three that we are most actively engaged with and leverage and also enjoy the benefits of are Apache Arrow, Apache Calcite and Apache Parquet. And so I don't know how familiar you and your viewers are to the each of those technologies. So if you want, I can quick overview of each of those. Yes. So let's start with Apache Parquet. So Apache Parquet is the, I believe, I think Calcite and Apache are about the same age, but Parquet is what I would call the de facto standard columnar on disk format for data. Okay. It was originally inspired out of the Dremel paper from Google and has been adopted by pretty much every big data technology there is. And so it's oriented the data on disk in a columnar format and that benefits analytical operations because you might have an analytical table that has hundreds of columns, but each time you're looking for data, you may only be looking for a few of those columns. And so by orienting the data on disk, you can get to the data that you want faster without reading this much data. And so that's sort of the core. And so any time we're writing stuff to disk, we're generally writing it in any Parquet format. We are optimized to read Parquet very, very quickly. Now the second project that I mentioned was Apache Calcite. Apache Calcite actually has come, there's little development on the code that is now Apache Calcite for almost 15 years, I think now. And as core is a pieces of a database. It's not trying to be a database itself, but it has several different pieces. One of the core pieces is a volcano inspired query optimizer. Which is a cost-based, highly powerful optimizer framework with a bunch of pre-built sort of optimizations to sort of figure out what can be faster. And it's used by I think probably 15 or 20 different big data technologies now as a way to understand SQL and optimize queries. So we use that extensively to do a lot of work. And then the third project is the newest of the three. It's called Apache Arrow. Apache Arrow was something that we actually were part of the team that started the project about a year and a half ago now, maybe two years ago now, not sure exactly when a little while ago now. But what we saw was a need for, so if you look at most organizations, they have a large number of different data technologies that can help them. But what happens is that when you're trying to work on a particular data use case, you kind of get stuck in one system. And the reason you get stuck in one system is because it's very expensive to move data between different systems. Okay, so that was one dynamic we identified. And so it's kind of like, that's why monolithic systems kind of work well is because you know that you can move from one operation in a monolithic system to another and you don't have to pay a lot of overhead. Whereas if I want to say, you know, do a pipeline which starts in Kafka, then moves to Spark, then moves to Python, then moves into a BI, back to Spark and into a BI tool, there's a lot of sort of overhead moving the data between each of those different systems. So that was the first sort of trend that we identified as a problem. The second one was that while everything is calmer on disk, people were generally, once they brought data into memory, they were working on it row-wise. And traditional databases always used to do row-wise operations. It was kind of the first way to do things, easy way to think about things, excuse me. But it's not the most optimal. So CPUs are very efficient on working on doing similar operations, having data lined up right so that you can pay attention to cash locality, being able to operate on multiple values simultaneously with some of the instructions and those kinds of things. And row-wise, this doesn't work very well with the way that modern CPUs work. And that's not just true for CPUs, but also GPUs. So GPUs are also very good at vectorized operations and sort of holding data in memory in a useful format to process very quickly. And so we saw the two different pain points there and we said there needs to be a new project, there needs to be a new way of trying to solve this problem. And so Aero was born out of that and so before the project even got started, there was maybe three or four of us. Wes McKinney was one of the ones that was involved very early on. Marcel Kornacker was involved very early on. Todd Lipcon from Foddera who works on Kudu was one of all very early on. We started having discussions about how do we collaborate as a group of different technologies to try to solve these sort of fundamental problems. And so had several discussions, ultimately decided, hey, you know what, there's a good opportunity here to build a new, to start a new Apache project. And so we brought people from all sorts of different big data and open source projects together, started up this new project and really its goals are to solve those two things I just talked about, which is one is how do we have a canonical way of holding data in memory so that it's very efficient to move between one context and another. And then the second thing is how do we make sure that that's a representation which is very high speed in terms of how fast you can process it and how well it leverages modern CPUs and GPUs. And so the representation was designed for that in mind. And so the project is really about those two things, about being able to move data in memory very quickly between different contexts and then being able, and that having that representation also be very efficient for processing purposes. So that's the third sort of big open source project that we are heavily involved in and try to help make successful. It's a group of a lot of different people who will contribute into it. We're just one of those, but we care deeply about those projects. And so Dremio as an open source technology is also built on top of these three sort of foundational components in terms of open source. And so I've been active in Apache for many years now and appreciate the consensus we can build there and also like the attractiveness of the license to be friendly to build a company that can support open source but also figure out how to modify stuff. Right. And now these three projects are part of Apache Software Foundation, right? That's right. The three foundation projects, Apache Foundation Project, our product is Apache Light, but it's actually something that we run ourselves. Right. But from what I am aware that Apache has a very stringent policies to get the package. They have incubation stages and they have attic and they have all those things. So what is it incubation or is fully what is the stage these three projects are at? Yeah, so these are all top level projects. So the Parque and Calcite projects both went through incubation several years ago and became top levels I think three or four years ago probably for both of those. The Arrow Project was a little bit unique in that because of the nature of the people who were coming together and what was already figured out, it actually skipped the incubation stage and went straight to a top level project. Primarily so incubation is about making sure that the community is going to be able to sustain itself and also making sure that the developers involved understand what we call the Apache Way, which is basically a way of approaching consensus building and collaborative development. So there are other open source models but Apache is very much back consensus. So the incubation is about making sure that those things are understood and are part of the community. And so with Arrow, I think the people were involved. I think there were maybe 15 or 20 people who were already Apache committers in other projects and maybe five or 10 of those were already Apache members, which means that they've been with Apache for a very long time and have a special set of rules there. And so it made sense to just go to a top level project for that one. Yeah, because I have a monitor in Apache closely and I have been doing open source for like 13 years because they have what I like about Apache Software Foundation is that when they do take a project, their policies, so they ensure that the product will be sustainable. It's not like developers will just suddenly disappear tomorrow. So that's it. It's really, really important and I think that ThinkCupy does a good job of that, right? Because while many incubated projects actually turn into a top level Apache project, there are some that do not because it turns out that people that are involved don't have enough time or the project is really driven by only a single organization is only sort of open and realize that it's not as open to consensus building as they need to be. And so there's many reasons that that doesn't happen, but it's a good way of sort of getting people to be able to wait around in the Apache space before they sort of dive head first. Yeah, because one of the concerns in open source products remains that I interviewed Brian Ballendar a few weeks ago and the one question was how to ensure that the project is sustainable because just because it's open source doesn't mean it can still die off. So yeah, Apache does a very good job there. In addition to that, since you're heavily involved with open source and the company itself is contributing a lot and open sourcing your product, I remember like when I started my own journalism career, which was like 13 or 14 years ago, I completed my course and the first job I got was with Linux for you magazine. So for the day one, I have been an open source journalist. Back in those days, one of the challenge was that we had to go out and educate companies that why you should use open source, what are the benefits. But now almost everybody's using open source, the question is who is not using and why not. So the new challenge that we see is that these new users of open source don't actually understand how the open source process actually works, that why you should send your developers to events or why they should contribute in their office time. So since you have been involved with for so long and you deal with a lot of customers and clients, do you see this pattern there where companies don't understand? Yeah, I think so. Let me say that I think at the start you're absolutely right, which is that one of the good things that's happened with open source is that most enterprises now have an open source first strategy. And so anytime they're doing a technical evaluation, they need to also look at what options available open source. And so that level of corporate sort of sort of top down sort of, hey, this is an important part of business strategy and should be considered. I think it's very good. I think that businesses are very good at taking care of themselves, right? And so I mean, that's kind of the nature of a commercial entity at some level. And so I think that people are very much like they appreciate all the benefits of open source, right? Because open source allows you to sort of, you know, protect yourself against the lock in the ability to extend the product if you need to, the ability to just solve a problem if you can't get someone else to solve it for you, because it's super important to your business. And so I think that a lot of people appreciate that part of open source when they're looking at it from a commercial organization perspective. I think that people don't really understand how it works. I think that at the end of the day, and I think there's two parts to that, right? First of all, realistically, if you look at most, well, I don't know if it's the most, a large number of Apache projects that have been started in the last five years or so are developed probably maybe 70, 80% of the development time that goes into those projects is by one or more companies that are directly benefiting from the success of those open source technologies, right? And so it's a former source paid open source or a sponsored open source, right? And so the idea that all open source is done by people who, you know, in their free time in the garage, it's not true for a lot of these projects, super successful projects that have been a lot of options. Not to say that it does exist, there's definitely cases where it's been very successful. And so my perspective on it is that if you think that in that context, it's not going to be realistic that every organization is going to be able to invest developer time directly against adding new features to open source projects, but I actually think that the sort of easier win for most organizations is that they can contribute to the quality of open source projects, right? So making sure that you are reporting back things that you see and collaborating to fix them and probably helping out to fix them sometimes is I think what would be the goal I would first say should be a goal for most enterprises using open source software is that if you're benefiting a lot from this, find the time to help with quality. I think that realistically, most organizations, unless they have a really important feature that they need, are not going to be able to contribute the amount of time that it takes to build something from scratch, but they can definitely improve the product. And that is through both providing bug reports, but also more possible trying to allow people a little bit of time to help fix things. And so it's actually a funny situation that just happened very recently is we had a customer who found an issue with our product. And then it turned out that it was actually an issue with an open source library that we were working on. And so the customer actually went and passed the open source library themselves because I think that the engineer was like, hey, I see what's going on here. And I think I can fix this. And so fixed it. And then we were able to incorporate it faster and get back to them. And so that's a nice thing when it happens. As people are driving open source, you can't expect that to happen very often, but it's really nice when it does and I think just right. Right. The way I've talked to a lot of people, the way some people submit as that, you know, if if you are consuming, you know, open source, there are two ways you can contribute. One is either through currency or through code. And as you rightly mentioned, that not a lot of companies do have developed developers resources that they can actually invest. So either, you know, they sponsor project or, you know, when they work with a vendor, who who is kind of, you know, either maintaining or contributing to the upstream. So they are indirectly, you know, kind of through currency supporting that work, because you are working with a vendor that vendor is, as you gave example, that, you know, he patched the developer patched. So you are giving feedback to that vendor and that vendor is putting all those changes to upstream. So I really don't think that, you know, if you are touching open source in any way, unless until you are just taking the whole code, forking it, running it to your own fork, then you are creating a lot of technical debt, you know, then you will not even survive no point in you. But I think if you're a consumer person, you are going to help open source in one way or another, there is no escape. You will be working with a vendor who will help. So that's, I think that's a neat thing people don't understand, but it happens. I think you're right. And I think the other thing that is interesting is one of the things that Apache really focuses on is community over code, which people who are not familiar with is probably a little bit strange when they hear it. But it's the idea that the success of a project is driven more by the community that's working on the project and their collaboration consensus and driving towards sort of good solutions than it is about one piece of code. And so we actually talk about, and in many cases there are PMC and or committers to projects that don't write code, right? They may provide documentation, packaging, they may just help out a lot of testing. And so I think it's very important to appreciate all of those things. And that's actually something to go out to people who are using open source as well is that even if you can't sit down and be a developer that's going to write the next feature or fix this bug, that doesn't mean you can't add value to open source and help those communities out. And in many cases, those things are actually sorely missed in those communities is that you've got developers that are happy to write code, but are in less comfortable writing documentation. And so if someone else could come in and say, hey, here's a couple of quick starts or a tutorial that can help out a project a lot. Right. So if there's a company, I mean, you did give example where, you know, if there's a company who's using open source and working with, you know, a vendor or something like that, and the company itself, they don't have developed their small startup, you know, a small company, they don't have enough developer, they can just, or I mean, when they don't have developer, they cannot even become part of the community to go in there. So how can they realistically contribute even a bit? What are the options, you know, since you have been in the community for so long, what are the kind of opportunities for them to participate without creating it as, you know, oh, why should we even bother with that? We can just pay somebody and be done with it. But how can they get involved? Well, I think going back to what I was saying is that it doesn't have to be a developer that's writing code, right? If you have a person who's using your product, like use cases is a good example. So one of the challenges that open source people have is that they want to figure out ways to test their product. And so they can come up with all sorts of unit tests and synthetic data, I work in data a lot, right? So you're always coming up with synthetic data sets or examples of potential problems. Customers can sometimes, depending on where they are, share their use case and share details of their use case. And that can help people, so that can help the customer, because it means that the open source project can incorporate that set of use cases into the test suites that are run every time someone does a release. But it also, so that helps out the users, but it also helps up the community, at least the product can be better, the quality of the project can improve. And so that's that's one example. But there are many examples of, hey, come in and tell us what you're doing. Tell us, you know, what's good with the product, what's not good. Remember that people are, many people are volunteering on the project. One of the key things, right, is that Apache, while there are many people who are paid to work on Apache project, the relationship to the Apache project itself is something that is a person to the, to the, the chair of organizations, right? So it's that relationship that actually defines what they can do. And so while they may be being paid by someone, because that person is very interested in, in adding certain features or functionality to an open source project, the relationship is a personal one. And so that that person at the end of the day is the one who's actually making that commitment. And so if you come into a community, appreciate those people for what they're doing, where they're giving their time and when they're spending their time and what they're caring about, because open source development, while maybe in many cases paid is still something you have to be passionate about in order to do a good job in the community. And so coming into the community and saying, Hey, these are the things that I've learned. These are the things that don't work well. These are all really good things to do. Say, Hey, here's some documentation. I just wrote up a little, you know, I wrote up a sort of short thing about how we do our use case and, and, you know, how we use the product to do that use case, do a little video. All of those things are very helpful to open source and can be done by anybody that doesn't have to be done by someone. Right. And it doesn't cost anything. You're just sharing your, but yeah, that's a very, I was talking to somebody a few weeks ago at DockerCon. And when I asked his, his, his thing was like, the biggest challenge for us is that because they're fully open source project and they're like, I mean, we wish that our customers will just, you know, just tell their story. We don't want anything else. Just come out and tell that you're using our product, how you're using it, what problem you face, how you solve them. You go ahead, download from a GitHub. We don't care, but at least tell us the story. That will help a lot in mind share and, you know, developers to get fit. Yeah, that's, that's excellent, you know, point. Anything else you would like to touch upon? Like we talked about the product. We talked about company. We talked about your engagement with the Patrick Foundation. Anything else that you think that we should have talked about, but we could not, or we did not? No, I mean, I would say invite people to come to those communities, right? So check out what what we're doing on Arrow. Check out what we're doing in Calcite and Parquet. They're, they're all really useful technologies. Come, come check out what we're doing on Dremio. Download the product, you know, fork it, give us your feedback. We always like to hear what people are doing with the things that we're building. Awesome. I think that sums up this interview. And thanks once again a lot for your time. And hopefully we'll see you again at the next show. Yes, absolutely. Thanks for your time as well. Good talking to you. And back to our audience. Thanks for watching and listening today. Please don't forget to subscribe to our podcast and YouTube channel. You can find the subscription links on dfir.io slash TV. See you next time. Have a nice day.