 Welcome back to theCUBE, our continuing coverage of Snowflake Summit 22, day two, lots of content, as I've said, coming at you the last couple of days. Dave and I, Dave Vellante and Lisa Martin are here with you. We have an exciting guest here next to talk with us about data discovery. Please welcome Shinji Kim, the founder and CEO at SelectStar. Welcome to the program. Thanks for having me. Great to see you. Excited to be here. Talk to us about SelectStar, what do you guys do and then we're going to uncrack data discovery. Yeah, why'd you start the company? Ha ha ha ha ha. Sure. So, SelectStar is a fully automated data discovery platform that helps any company to be able to find, understand and manage their data. I started this company because after I sold my last company, Concord Systems, to Akamai, I started working with a lot of global enterprise companies that manages a lot of IoT devices, like automakers or consumer electronics companies. And it became very clear to me that companies are not going to stop anytime soon about collecting more data more often and trying to utilize them as much as they can. And cloud providers and all the new technologies like Snowflake has really helped them to achieve that goal. But the challenges that I've started noticing from a lot of these enterprises is that they now have hundreds or thousands of data sets that they have to manage. And when you are trying to use that data, it's almost impossible to find which specific field, which specific data sets that you should use out of thousands and hundreds of thousands of data sets you have. So that's why I felt like this is the next problem and challenge that I would like to solve. Also because I have a background of working as a software engineer, data scientist, product manager in the stages of creating data, transforming data and also quarrying data and trying to make business decisions on data. Having a right context about the data is so important for me to use that data. So for us, we are trying to solve that challenge around finding and understanding data and we call that data discovery. Wow, that's music to my ears here because I can't tell you how many meetings I've been in. Somebody presents some data and said, okay, what's the source of that data? What are the assumptions used to derive that? I have different data and then it becomes this waste of time. My data's better than your data or everybody has an agenda. You cut through that. Yeah, so data discovery in a nutshell, we define it as finding, understanding and managing your data. So in select star, we will automatically bring out all your like the schema information. Where does data exist? We will also analyze the SQL query logs as well as activity logs that's generated by any applications and BI tools that are connected on top of your data warehouse so that anytime you're looking at a database, any particular database table, column or dashboard, we will tell you where did this data come from? Where did it originate from? How was this transformed and which reporting table does this exist? Who's using this data the most inside the company? How are they using it? And which are the dashboards and reports that are built on top of this data set? So you don't have to go out and ask everybody else, hey, I'm looking for this type of data. Has anybody worked with this? This is actually something that I realize a lot of data analysts and data scientists waste their time on. So yeah, that's really what we call fully automated data context that we provide to our customers so that you can truly use all the data that you have in your data warehouse. And you do this by understanding the metadata or is it some kind of scanning or using math or code? First of all, we do connect and bring out all the metadata. So that's all the information under information schema. And then we also look at all the query history. So all your select SQL queries, all your create queries, create table queries, create view queries. And based on that, we will also match the metadata where it exists inside those queries and logs. And based on that, we will generate first and foremost what we would call column level data lineage. Data lineage is all about showing you the flow of data from where it was originated, how it was transformed and where it exists now. And also what we call popularity. Who's using what data? How are they using it? And in aggregate, you can also find out which are the most important data sets in our company, which are the data sets that can be deprecated because it was like a duplicate of other data sets and nobody's using it anymore. And we like put a popularity score for every single data asset that you have in your company so you can see how that's being used. How do your customers take action on the information that you provide them? Do they ultimately automate it? Do they go through a process of sort of a human in the loop? Well, we do the automation for them. And we do also provide them with a really easy to use user interface so that they can add any semantic level data on top. So that's like tags, like whether you want to market as this is an analyst approved table or do not use table. Or if you want to put a PII classification of data, you can do that on a column. And we will automatically either propagate those annotations throughout the platform. We will also automatically propagate any same matching documentation that you might want to use within the data warehouse. And we will also provide you with more of a rich text documentation that you can also add on top as a business glossary or like a weak key that business users can get a better understanding of data concepts and models as well. How do they tag the data? Do they use another tool that does that? They can tag it within select star. Any table or column has a little icon, tag icon, so you can click on it. Or we also give you a view of every database page. We'll have all the tables in one place. You can add a keyword and bulk tag. So humans tag. Yeah, so humans tag. So in the beginning humans tag and then we will automate the propagation of that tag. So if you already tagged let's say SSN field as a PII then we will find all the other columns that may use the exact same data and also tag the same. Just as an example. Okay, so once the human puts it in there then you automate the downstream. Because humans sometimes aren't great at classifying and tagging as inconsistencies and I would think that you could use math to improve that. And we do have some plans to add a more automated tagging system. For example, we don't necessarily tag them but we give our customers filters on top of their search results to see which are the data sets that nobody's using anymore, which are the data sets that's been created very recently and you can also filter by who created them or who are the owners. So these are some of the aspects of the data or even like when is the last time was this data updated? So these are the aspects of the operational metadata that we are starting to automate to put more automated annotation. I would say is more coming up towards the end of the year but in terms of semantic level tagging, like is this data set around customers? Is this data set for marketing, sales, customer support? That is something that we are giving a really easy to use interface for the data team to be able to easily organize them. How are you helping organizations? We think of all the privacy regulations and legislations. How is the lecture a facilitator of data privacy for your clients? Is it part of that play? So I would say one of the main use cases of data discovery is data governance. So starting this company and starting to work with a lot of Fortune 500 companies as well as I would say more like recently IPO companies that have grown very fast in Silicon Valley. Some of those customers have told us that they initially adopted SelectStar because they needed a good data catalog and search platform for their data team. But as they are starting to use SelectStar and starting to see all these insights about their own data warehouse, they are all kicking off their new data governance projects because they get to see a really good lay of the land of how the data is being accessed today. So this is why we have a very easy to use and also programmatic API so that you can add tags, ownership and set access control through SelectStar. We are actually just releasing a beta version of what we call policy-based access control where you can use either role-based and attribute-based access control so that different roles of the users get to see different versions of a SelectStar when they log in. And this is just the beginning. Like PII is, for example, any column that's already marked as PII, we will always strip out the value before it gets fully processed within SelectStar. So even if anybody might stumble upon any SQL queries that other analysts have run, those values won't be available in SelectStar at all. And you started the company right before the lockdown, right, or right ahead, that must have been crazy. March 2020 is my incorporation of SelectStar. It was a very interesting time to start the company. And in a way, I'm glad I did. We had a lot of focused time to really go heads down, build out the product, and work closely with our customer. And today it's really awesome to get to, you know, provide that support to more customers today, yeah. And so what are you doing with Snowflake? So Snowflake has been a great partner for us. A lot of customers, and Snowflake is really great for this, basically building single source of truth of your data by connecting all your source, you know, databases, as well as like your ERP, CRM systems, ad systems, marketing systems, SaaS platform, you can connect them now all to Snowflake that will all dump all the data inside. So that allows data team to be able to actually join and cross match the customer data across so many different applications. And what we see from a lot of Snowflake customers, hence they end up with many different schemas and tens of thousands of tables. And for them, now they are requiring or needing more of a better data discovery tool so that they can use and leverage Snowflake data that they have. So in that regard, so we are a Snowflake data governance accelerator partner. And as part of that accelerator program, one of the things that we've integrated with Snowflake is what we call Snowflake Tag Sync. So if you create any tags in select star and you marked it as a PII, we will also replicate the same tag to Snowflake. And so everything is synced in there. And on top of that, a lot of our customers really like using our column level lineage because we will show how all the data tables within Snowflake is connected to another. And actually last one at least, we actually just released this feature today called the auto-generated ER diagram. ER diagram stands for entity relationship diagram. ERD is like a blueprint of your data model. When someone, when your engineers and data architects start creating tables in databases, this is a diagram that they will put together to show how they are translating business logic into data models in the databases. And that includes which are the fields for primary keys, foreign keys, and how are different, like when you look at star schema, how different tables are joined together. When all these tables gets migrated into Snowflake, a lot of them actually lose the relationships of primary keys and foreign keys. So many analysts, what we found is that they are starting to guess how to join different tables, how to use different data sets together. But because we know how other analysts have actually joined and used the tables in the past, we can give them the guidance and really nice diagram that they can refer to. So that is the ERD diagram that we are releasing today available for all customers, including our free customers where you can select any tables and we will show you the relationship that table has that you can use right away in your SQL queries. And that will facilitate, that simplifies doing more complex joins, yes? Which is an Achilles heel of Snowflake. That's not really what they are about, but they have to rely on the ecosystem to help them do that, which has always been their strategy. The company founded in March, 2020, amazing, and then relatively small still, yes? Or is it self-funded? I mean, I raised a little bit of money, but what's your status there? Yeah, we raised our seed funding when I first started the company. We've also raised another round of bridge round last year and we plan to raise another venture around the funding soon. Great. And we're going to be making those investments. What are some of the key parts of the business that you're going to use that funding for? There's a lot to build. Yeah, engineering. Obviously more automation features, but having, I would say right now, we have now built a really good foundation of data discovery. And that includes fully automated data cataloging for metadata, column level lineage, and also building the usage model, like popularity, who's using all that type of stuff. So now we are starting to build really exciting features that leverages these fundamental aspects of data discovery, like auto propagation of tags. We also do auto propagation of documentation. So you write one column description once and it will get replicated and changed everywhere throughout your data model. We have also other things that we have in store, especially more for next year, are package support for specific use cases like data governance, self-service analytics, and cloud cost management. Nice, lots of work. Pressive, I'm blown away. And you've accomplished this during a pandemic, that's even more impressive. Thank you so much, Shajif, for coming on, talking to us about SelectStar. What you're enabling organizations to do really derive the context from that data, taking a lot of manual work away. We appreciate your insights and your time and wish you the best of luck. Thanks so much for having me here. This has been great. Good, thanks so much. For Dave Vellante, I'm Lisa Martin. You're watching theCUBE's coverage of Snowflake Summit 22, day two. Stick around, Dave has an industry analyst panel coming up next. You won't want to miss it.