 It's the Cube, covering HPE Big Data Conference 2016. Now, here are your hosts, Dave Vellante and Paul Gillan. Welcome back to Boston, everybody. Brendan Stennett is here as the co-founder and CTO of Think Data Works, and we're going to talk about open data, Brendan, welcome to the Cube. Thank you, thanks for having me. So, in from Toronto, I said that right, my kid corrected me the other day, it's not Toronto, Dad, it's Toronto, but so, welcome to Boston. Tell us about Think Data Works, why did you start the company, what are you guys all about? Yeah, so we basically wanted to create something to basically look and peer into the world of open data. Open data, we kind of saw something that's very valuable. It was still kind of a new trend when we started this thing. And what we found was that if you actually wanted to find open data and use it, you're going to have to go into all these different government portals. We're talking about every level of government, from the feds to state governments, to county and regional sources, municipal sources, everything. They all have different data under their jurisdiction and they're all kind of doing it their own way, releasing a different file format, some proprietary, some open, whatever, it's a mess. What we wanted to do is basically try and help solve that messiness of open data. Provide one window to search into all the open data that's being published and made available, as well as we actually ingest this data, we do some standardization on it, we'll clean up things like dates, they're all in the same sort of format, currency things, even different locality information. We have Quebec in our country, which is obviously French. They will represent the thousand separator, for example, differently than the English-speaking world would. And that sort of thing makes dealing with open data across different jurisdictions difficult. So that's the sort of thing that we like to standardize and bring in. From there, we provide the ability to search for it. And if you want to integrate open data into your process, we provide API where you can just go and tie that directly in or just full export of the data to bring it behind the scenes and integrate it with your existing private data. So you started the company just to make all this easier for people? Yeah, yeah, exactly. We wanted to do something with open data. We weren't really sure what we were going to do with it. We were thinking, well, maybe we can take open data and apply it to real estate or finance or any of these different industries. And we said, well, we don't know anything about any of these different industries. So how are we going to try and take open data and try and make them better? But what we did see was there was a common access problem to getting to the data itself. And no matter what we did in any of these industries, we'd have to solve that access problem for it. So we said, well, look, I kind of think I can solve that. Let's maybe start a company doing that. What's the most sought after open data set? Is it weather? I'm sure you get that question a lot, but is it Google Maps data? What is the in demand? Yeah, it's honestly everything. Like we see a lot of traction in finance. So there's a lot of stuff there, there's a lot of really cool things. Like there's parking meter locations, for example. This is the one thing that you don't really think about, but usually when you get directions, even with Google Maps or whatever it is, parking meters are where you really want to go, not outside the, you know, it would be dropped off with a door in your car in the middle of traffic and say, okay, I'm just going to stop here. You want to take the closest parking meter, the closest parking spot. And hopefully someday maybe know which parking meters are actually open. Some of these are smart meters where they actually know the vacancy of the spot and we might be able to know if it's actually an open spot and guide you right to the open spot. All right. Where do you find this data? There must be new sources emerging all the time. Yeah, so it all comes from, with open data, it all comes from government open data portals. So that's every level of government. We've kind of created techniques to like kind of search for these different portals, look for certain commonalities, look through lists of different counties and automatically kind of scrape and try and find where these portals are and then tie into them and adjust them into our process. And we listen, we have like Google alert set up so when new things are getting popped up every day, we got team members monitoring Twitter all the time when new open data portals are coming on. We get a lot of them as they're basically the day of release on the platform ready to go. Sometimes people reach out to us too, as they're releasing their portals. What are some of the more interesting and formatting problems you've had to solve? Everything. There's, it'll be things like somebody releasing this XLS file like an Excel spreadsheet and they've probably spent five years of their life working on this Excel spreadsheet and they've got all these macros integrated with it and you can drop downs and if you want to just use the Excel spreadsheet, it's probably going to get the job done but if you want to bring that into the rest of your database basically and try and integrate that into your process, well, good luck. So those things are always very annoying and very interesting to come through but a lot of the other things that we've kind of solved with deal with all the different proprietary formats like Shapefiles for example or GeoJSON or KML or GML or any of those or any of the different CSV formats or it's basically everything, the coding issues. And is it correct to say that you've essentially developed a search engine as the interface? Yeah, for an open data, exactly. You call it NAMARA? NAMARA.io or give us some background. Yeah, NAMARA.io. Yeah, it's exactly it. So basically once we've indexed and basically linked to all these different open data portals, we now provide a window to search into them. That's available for free, we don't charge for that capability, we actually don't charge for any of the access to open data to certain number of API calls. So a lot of that's just here you go, use it if you want to use it. From there we'll take other steps and try and make open data even more digestible for large companies that are trying to use it. It's examples like we've got the parking meter example that I just used. If we wanted to basically take parking meters from 300 of the top cities in the United States for example, you're still having to deal with 300 different data sets that are gonna not only have their, well now if you're just using us at least they're in the same format but we got different problems like different columns to represent so different fields, right? Some of them might have the hours that they're operated and spread across three different columns or some might just have it as one. Some might have the actual rate on them, like it's all the stuff that happens that we want to try and solve that as well so what we're starting to do now is try and come up with these standard attributes for these data, right? If you're trying to represent parking meters this is how parking meters should be represented. And we were talking offline you said you don't go after real time but if I understand that you would if in fact it was an open data set so my question is with the whole internet of things explosion coming on, I would think many of those parking meters, parking lots, buildings, et cetera are going to have open data sets that you can tap into. Is that right? Yeah, absolutely, yeah. We don't adjust any of the real time data because we're kind of more of static snapshot of things as they get updated but we do map to it so you've got things like Google transit feed system like those feeds so we'll map to all the ones that are available there that if someone is actually trying to integrate with it there's other things like next bus APIs where you would be able to see when the next bus is coming based on the GPS tracking location. So we'll tie to all these. We don't have as high level of integration with them as we do with some of our other data sets but we'll map to them. About three years ago President Obama signed the open government initiative and the idea was to standardize data formats across the federal government. Have you seen any changes as a result of that? Is that driving any change in the US statistics or trickling down to other municipalities? Yeah, I think the trend generally is going open for the most part. Ontario we've actually just released something called open by default which is this policy basically saying that unless the data is going to impact somebody's privacy or national security it's going to be made available and it's going to be put open. So I think the trend is more open than closed so absolutely. And we were talking about a thousand that you know of, open data sets were wide right now, is that right? Yeah, we've mapped to a thousand different sources of open data which translates to about 75,000, 100,000 different data sets that's being made out there and that's not worldwide. Unfortunately we only have US and Canada right now. Okay, that's North America. It's North America, yeah. It's a thousand. Okay, yeah. So there's going to be much more once we go in there. We haven't, that's the next step. Presumably, at least double that I would think of. Oh, absolutely, yeah. Triple or triple. And it's everywhere. Like Mexico's got a huge open data movement as well. Like countries that, South America's got things going on now. Yeah, it's everywhere. Tell us more about the company. So how do you make money? You're a for-profit organization. What do you sell? Yeah, so we sell the, it's the same data basically that we also give some of it in some degree away for free but we try and package it even better. Like I just gave that example of trying to call common attributes. So that's where a start of our effort goes in with our data scientists team to actually try and look at all these data sets and how can we represent them as one and how can we get rid of a lot of these differences in the data, package that together and then we'll sell that off. This is people that are trying to digest hundreds or thousands of the same data set just to different municipalities that actually have a business case for it and can justify spending the money for it. And you'll sell that as a service, an ongoing service? Exactly, yeah. Okay, and how many are you? Talk about funding. How did you get this thing off the ground? Yeah, we raised the seed round about 18 months ago. So we're still going on that. We're doing fairly well for ourselves right now. We're still a small team. We're only 12 people. Great. And yeah, Toronto's a little cheaper for a startup so we can make that money go a lot further. Depending on the exchange rate it has been expensive at times. Well, but developers are cheaper. We're not paying the same price we are for developers. We're not paying the same price for rent. We're not paying, the developers themselves aren't paying the same price for their personal rent so they can do it for cheaper than being down the valley. We don't have to worry about healthcare benefits. We've got all the other benefits, but healthcare we don't have to worry about because that's already included in Canada. These are big things, right? There sure are. I think starting a startup in Canada, especially Toronto is sort of the best time right now to actually do it. Well, there's a strong software base up there. I mean, IBM has a huge presence up there. Absolutely, yeah. You must see some good DNA. You must see some interesting applications built using your API. What are one or two of your favorites? Yeah, to be honest with you, a lot of the stuff that we're working with are large companies. We're talking about banks, other companies that I can't really talk about. And some of the applications that they're doing with it are absolutely phenomenal. We only get to peer into some of them. Like our customers really are the people that are paying for the data and they're building the stuff that is really interesting. Fortunately, a lot of the stuff I can't talk about. You can't even describe in general. I mean, we're not asking you to name names. Yeah, yeah, exactly. There's some companies that are doing some really interesting things with linking government purchasing data around together. So seeing trends in purchasing and how money's flowing through government. A lot of people forget that federal government's one of the biggest buyers on any single market. So if you can kind of see how they're buying things and you can kind of relate that back to, is this company strong? Is there a high risk for, can we lend a loan to them with this? Information that becomes valuable if you actually use it the right way. So how do you use Vertica? Where's that fit? Talk about the problem that it solves that you maybe couldn't do without it? Yeah, so we back all of our data on Vertica. So we'll warehouse all this open data so we make it available with rich API access without having to just export flat file and integrate it that way. Vertica's great at what it does. So it is your database, is that right? Well, for that data warehouse component, yeah. Okay, yeah, not the transaction take place for sales. But I mean, could you have done this business without a Vertica-like platform? Well, we were on other solutions before Vertica. Oh, you were, okay. And we've kind of just stumbled to them. Now it came to, sorry, it came to Vertica now. And... So yes, I guess. Yeah, we could have, but not as well. I suppose. What's different? Yeah, take us back to the sort of before and after. What changed when you brought in Vertica? Yeah, some of the stuff that we were just starting to come in to see was just load time performance, to be honest with you, as all this data, as we're getting more and more data in our thing, all the stuff's updating all the time and we are starting to really suffering in load time, where we can constantly have our queue backed up and we can expand and contract most of our pipeline. But the database is a little bit more static. And you were using a traditional RDBMS before Vertica, or is it another MPP database? Yeah, we were using it another MPP one before. I didn't have all the capabilities that we wanted. We moved to actually just a pure index at one point. That worked great. It was just a little bit more expensive than we would like to be doing. Load time's also suffered, so. So it was something that you built on your own? No, we were actually using Elasticsearch at one point. Oh, okay. So how was it, which was? Oh yeah, I'll get you. Yeah, honestly it was phenomenal at giving query performance, blazing speed at actually delivering that, but it's on an index where the data size on disk is just so much bigger that that actually does become an issue at that point. Load time was also a lot slower. So these are sort of things that we couldn't adjust with the same technology, so we had to kind of move to the next step. I always kind of wanted to use a column store for this particular problem, and it's kind of just found the right one. Because why? Add some color to that statement, just from a, you know, translate from a technical mind. Right, right. Well, the data doesn't change in individual chunks. Like it's not like I update my email address on a website and then that cell's changing in the database, right? The whole thing is going to change at once. So it very much is write once, read many times. When you get in a situation like that, column store has a lot of advantages as opposed to a row store, which is meant for rapidly changing and updating data. This is meant for reading data, analytics, high speed performance, high concurrency, et cetera. How do you deal with changes to source data? We dump and replace, basically. It's the quickest and easiest way to do it, and we can start doing GIFs in it and, but it's just faster to just, I tell you, I'm looking at numero.io right now. I'm already hooked. The Houston Dangerous Dog Registry. Oh yes. One of the many phenomenal databases that you have here. This is really a great resource. Well, I went in there too when I started. You have to sort of train your mind to think about what you're actually looking for. Like what you have to do with Google as well. This is actually one of the big problems that we've found with people using our platform was that without being able to like, kind of browse the data, you didn't really know what you're looking for. So people just press enter, and then maybe you get some arbitrary data sets that it might have been recently updated and kind of close to the top, but you don't really know what's out there. Right. So we're actually in the middle of a complete like relaunch of our platform. It's like two weeks out, so really excited and wanted to get it ready for this, but it was a little too, we didn't want to push anything that wasn't ready. But it's way more centered around that whole problem, which is just browsing data and like actually trying to find what you're looking for without having to actually type a search query. Yeah, kind of give the user some visibility as to what's in there and spark some ideas that they can then go explore. So sample search. Cool, yeah, we're actually, we've been working really hard with the new census data actually, so that's going to be all in there and I'm really excited to see what people are starting to use for that. We've got US procurement data, we've got import, export data, so this is data of things that are leaving shipping containers as they're coming out. Extremely valuable in any industry basically, so. Are you keeping just current data or archives of historical data? Anything that's made available, basically. If there's historical data made available, we'll keep it. If it's just the current data, then just the current, yeah. You're making the world's open data available through an engine that is going to expand beyond North America and scale, I mean, architecturally you can scale virtually infinitely. Is that fair? Yeah, the right technology, like exactly things like Vertica is not going to be. Our database might have been the problem before, now it's not going to be the problem and it's continuing advancing the technology to make that happen even faster, so. Awesome, all right, Brendan, we'll give you the last word, maybe your take on Vertica or HPE's big data conference, things you're hoping to learn. Honestly, the conference is great. I don't know if you caught the speakers this morning. Yeah, we did. Really, in fact, we didn't mention that in the keynote. There was, in addition to Colin, there was Phil Black, the Navy SEAL. Yeah. It didn't make me want to become a Navy SEAL, but I awed by people who do. Yeah, and Steve Spear, we're going to have on. Right, I mean, you read books about what they go through in Hell Week and you say, I never would have made it. I would have been ringing that bell in the first 15 minutes. First 15 minutes, yeah, exactly. I don't think I'm kind of far either. So, sorry to interrupt. But no, it's been great so far. A lot of friendly people, everyone kind of wants to talk, share solutions and similar problems that you might be having and a chance to actually talk with people that are having those same problems. Great, well, congratulations on getting the company off the ground and getting some initial funding and good luck. Great, yeah, thanks for having me. All right, keep it right there, everybody, we'll be back. This is theCUBE, we're live from Boston. Back after this word.