 The Carnegie Mellon vaccination database talks are made possible by Autotune. Learn how to automatically optimise your MySuite call and post-grace configuration at autotune.com. And by the Steven Moy Foundation for Keeping It Real, find out how best to keep it real at stevenmoyfoundation.org. Welcome guys to another vaccination visit to Mercury. I'm excited today to have Vishaka Gupta. Vishaka is the co-founder and CEO of Aperture Data, which is the main company backing Aperture DB. So I've known Vishaka for several years. I first met her when she was a researcher at Intel Labs, working in the database space. And before joining Intel Labs, she did her PhD at Georgia Tech and she's also CMU along. So Vishaka, thank you so much for being here. For the audience, if there are any questions for Vishaka as she's speaking, just unmute yourself and ask your question anytime. We want this to be a conversation and not just her talking to herself for now. So with that, Vishaka, thank you so much for being here. Thanks so much, Andy. It's great to be back at CMU, even if it's virtually. So thanks, Andy, for arranging that and thank you all for being online today. Let's just get started. So images and videos or visual data is special. And I know you will say, well, of course, you're your company based on that, so you're going to say that. It is special because it's rich in information. Data science and machine learning techniques in the last decade have really helped companies from various domains like smart retail, smart agriculture, medical imaging, understand visual content and enable better customer experience, which gives them a competitive edge if they can use it right. Naturally, this has led to a rapid increase in the amount of visual data that is now accessed a lot more frequently and needs to be managed. So a valid follow-up then is, do we know how to deal with it? And well, because images and videos can be individually large. Sometimes we've come across medical imaging use cases where one image is 2 gigabyte large. And typically they also occur in large volumes. You have millions or sometimes billions of images and videos that you have to deal with. Another unique aspect of this type of data is that the transforms on visual data are also unique. If we are thinking databases, they typically tend to support operations like some or average, which lose meaning when you're talking about visual data. The equivalent and common operations would be reprocessing tasks like resize, crop, rotate, maybe sample videos that can be performed on images, frames or the videos themselves. As common as these are across applications, they can also be quite computationally expensive. So the complexity of operations that we are talking about or transformations is unique in this case. Visual data is also typically accompanied by a lot of metadata. Because the metadata doesn't just contain information about the visual object itself, like the file name or the size of the image. But it also describes more about the surrounding application, like where did that data come from? What are the annotations in it? What are the features describing the content and so on? So to see if we have a handle on this, let's start with an example. Imagine you're a data scientist or a machine learning engineer who has to train a model to detect thousands of new classes of objects and train with hundreds of thousands of new images that you've just received in some cloud bucket. So I want to show you a quick example of just some examples. Just think about someone who is trying to understand the data and use it. What are the steps they have to go through before they can actually do that job? So if you're working in a big company, for example, this is typically in such cases, you'll first have to find, okay, which is the right database and table to use from which I can query my metadata. Now then you go get permission to access the database and table. Now you have to make sure that you allocate large enough virtual machine for yourself to be able to download all this information and run your training models. And let's say that the data was actually stored in a Postgres table. So this is just a sample query and it's not exactly matching any particular use case, but this is something we wrote for another paper from which I'll use the numbers. You first have to go perform all the joins, access all the tables, and get URLs to the images that you're going to use to train your models. So that's step one. After all the setup, now you get the metadata. Now you finally have URLs because the metadata pointed to the URL in some cloud bucket. And so now you get permission to, and now you have to go get permission to download them. So you get the metadata, you've downloaded the images. Now let's say your model expects 224 by 224 pixel. So now you have to resize the original images, which could be any size, 1k, 4k resolutions. So now you have to bring in some open CV type library, implement your resize rotate functions. And now you're finally ready to train or classify or whatever it is your original task was. Now let's see what that can look like with ApertureDB. You literally get access to the Python client, connect to the server, then put your constraints. In this case, I have a very simple constraint, but you could put constraints like all images with horse in it, all images with dogs in it. In this case, I'm just waiting for all images of a particular license. You create a data set, let's say you're using PyChorge for training. And that's it. You query it, you have 120,000 images. You can just check what's in a particular image and you run a classifier. And it quickly tells you that it shows a ball player and a baseball player in this case. So that's as easy as it was to create that training or classification pipeline with ApertureDB. That's the transition we wanted to bring about. When we encountered the data management status quo for the first time, when Intel launched their science technology center around Visual Cloud. My co-founder, Luis and I were at Intel back then, and we started working with machine learning researchers from Carnegie Mellon and Stanford, building frameworks to detect and understand content in videos that they were collecting. We were honestly very excited to get a chance to learn from the latest and greatest in computer vision, deep learning. But what we ended up spending most of our time was managing these videos and the application information. Because there just weren't existing tools or databases that could take, that we could just download and use for this complex unstructured data. The more we looked around, the more we noticed data infrastructure being hard on a lot of other teams of data scientists and machine learning engineers trying to use images and videos. We could also see that this data was just growing so rapidly and an explosion coming in the quantities in so many different domains. So it made sense for us to solve this problem, solve it well for people beyond just our team at the time. That's basically what led us to spin out ApertureData. Since computer vision, so for today's talk, since computer vision and machine learning workloads of use cases are relatively new, after motivating the problem and what ApertureDB can do, I'll briefly describe what our users and their use cases look like. We will then spend most of the time in the design decisions we made as we built this database, how that ties to the performance we see and close with our next steps. So coming back to why we believe data management platforms are not designed for images or video based machine learning analytics. After hundreds of conversations, so ever since leaving Intel to do this company, we've had hundreds of conversations with data scientists, senior data scientists, machine learning engineers across different companies in different domains. What we have learned is how complex a landscape is for dealing with images and videos for this application. Because dealing with visual data types means a lot more than just image video files. It means multiple data types like application metadata, annotations. Sometimes there are n-dimensional embeddings. Not only do you have to deal with different types of data, but your current option is to deal with these data types in different products that address those types. Of course, the lack of interoperability among them means you have to translate which hurts performance and worse, it makes data lifecycle management quite painful because now you have to ensure the data that you insert is consistent across these various products. Beyond managing the data, pulling the right data out of various systems leads to long engineering delays and requires conversions and processing to feed into your machine learning application layers. It can also make data science teams reluctant to keep refreshing their data sets or changing metadata schema due to the fear of taking more time away from actually understanding the data and being productive. Last point I want to drive here is that creating point solutions on a neat basis, which is what today most people end up doing. It also means lack of reuse, which further prolongs delivery of their business objectives. And in all of this, we haven't even talked about the orchestration components to instantiate all of this or the components to authenticate and control access to the different sets of data. Now, I want to use a classic development sequence to really paint the picture of how these what we call DIY or do it yourself systems evolve and become extremely messy over time as you want to transform your input data into useful outputs like let's say you're trying to suggest some recommendations or your training models to deploy or just display data to users. So you start with just receiving different types of data like you have images and metadata, you might have videos, you might have annotation. Naturally, that goes to storage buckets and some choice of databases. And the choices depend on who is implementing the solution at the time. Very often, you need labeled data for training models. That means now you bring manual or automatic labelers that need access to these data sets. And then you have to take the annotations that they provide in whichever format they are in and find a way to land them. Sometimes they go in cloud buckets, where they are not searchable, or sometimes people transform them in and store them in a database so you can use them later. After all this is when the DIY scripts come into picture which bring data together and return a data set to a machine learning framework or a front end to display. Sometimes any of these steps can require pre processing or augmenting the original data for various reasons like it might be because your neural network expects it or you might need thumbnails just because it's easier to display. And then finally in some cases, if you're building recommendations or you need to find similar looking objects, not using keywords but actually using content, which is fairly common. You have to address the indexing of high dimensional visual feature vectors and then support canier neighbor and similarity searches. So what that ends up looking like is a complex glued together system that is brittle and painful to maintain and reuse, not really an eloquent way or efficient way to solve this. And imagine what this looks like when you add multiple use cases and users and you keep scaling your data. What people have been wanting is a single unified approach for dealing with all these types of data. If you could have a unified holistic purpose built system to do all that it would mean enhanced productivity with simplified data engineering and much faster ways to iterate on machine learning, rather than spending a lot of time on data infrastructure. A system like that would also be able to evolve rapidly as machine learning evolves because new machine learning methods mean you can extract more information, which means you've got to go update your original data with richer metadata. And such a system would also be able to scale as rapidly as the data grows without and all of this without disturbing any of the user pipelines. That's what we offer with ApertureDB. It's a purpose built visual data management system for analytics. ApertureDB natively supports management of images and videos. Given the range of usages for this data, we provide necessary preprocessing operations like zoom crops sampling creating thumbnails as you are accessing this data or storing it into ApertureDB. Now, given how key metadata is to finding the right subset of your data, we manage application metadata as a knowledge graph, which also helps us capture internal relationships between the metadata and data and enable quite complex visual searches. Given we would always start getting machine learning applications, we also support bounding boxes for labeling so that you can do annotation based searches using this metadata graph. Since feature vectors are representations of the content and images of frames, they make it possible to find visually similar objects. That's why we also offer similarity search using feature vectors. Now, one of the important goals we had was that we did not want our users to have to deal with multiple systems. So that's why ApertureDB uses a query engine or an orchestrator to redirect user queries to the right component, collect the results to return a coherent response to the user, and it exposes a unified JSON based native API to all the machine learning pipelines and end users. Now, these pipelines are users can execute queries that can add, modify search, visual data and metadata, annotations of feature vectors, you can perform on the fly visual preprocessing, and you can do more machine learning tasks like let's say you want to create a snapshot of data that you trained a particular model on. These are the kind of things that you can then use ApertureDB for. Now, in the previous slide, we talked about how ApertureDB manages the data and metadata. It can actually store and access this data from cloud buckets or storage managed by ApertureDB itself. We also provide C++ and Python client packages to access the ApertureDB server since most of our users prefer those languages when building their applications. And of course, you know, we provide a good set of additional tools to simplify integrating in the user's ecosystem. For example, to ingest data into ApertureDB, we provide fast concurrent loaders that just expect simple comma separated files with metadata and data files that can either be local or you can provide us with URLs. We make it really easy to ingest data into ApertureDB. We also have data set loaders, the previous example I showed in the notebook for machine learning frameworks like PyTorch, where our loaders hide the complexity of fetching training or classification data and batches, and users just need to specify a filter for what data they want to train on. And then behind the scenes, we take care of when the next batch should be fetched and how it can be used by any ML framework. Now, we also offer REST API that is used by our web frontend, but can easily be used by any labeling frameworks or just, you know, your in-house web UI that you might have. And then to support enterprise deployments, we also offer authentication, role-based access control, we support audit logging, and we support monitoring. All these are things like, you know, as you move to production deployment, you kind of need all these things to work. I wanted to bring this picture again to show you how a purpose-built system can really simplify your data pipelines and then allow you to focus on the machine learning and data understanding that is actually the primary objective of data scientists and ML engineers. There sure are additional benefits beyond the productivity boost, such as performance. Before we dive into the components and the design decisions we have made and look at performance, I wanted to talk about some of the benefits to various user types that we have come across just to give an idea of who benefits and why they need something like this. So all this work and the design decisions we have made, they've been driven by who we want to help. So to lay it out more concretely, with ApertureDB, data scientists can move faster when model tuning and deployment, doing any searches for data in order to build use cases like if they want to do classification, object detection, if they are trying to do activity recognition, or just building something to recommend what to see next or what product to buy. So data scientists and ML engineers clearly benefit from this. There are also other key personas that can really benefit. For example, there are infrastructure teams who now have to maintain one system as opposed to five that they had to maintain before. Then there are data engineers. For them, now they have simpler time managing lifecycle of data because everything is taken care by ApertureDB internally. There are also, because there's a common visual repository that all data scientists can now work off of, data science managers can see more team collaborations and faster results. And then there are also some important design requirements that we take care of in terms of security, privacy, monitoring, reliability, and availability. And for anyone wondering how much do companies really care about this? The visual intelligence team at a major home improvement retailer was really concerned about the loss of productivity because their complex DIY setup would cause days of delays each time you started doing some new, you started building up some new pipeline or training some new model. So now we are managing millions of images and product information and annotations for them. And in fact, they are expanding to start using our embeddings support to do similarity search and build their recommendations pipeline on top of ApertureDB. We have worked with the use case from a camera intelligence team around label management for frames, grab from retail cameras. There's also a healthcare use case around how much you can attach any any n dimensional embeddings to document images and then search using the similarity search functionality. So now that we know the why why now who and how it's a great time to switch gears into some of the design choices we have made over the course of developing this database and its unique API. We had already established metadata was important and would be the key to finding the right data subsets in most common queries. So the natural question was how do we store it or model it. We looked at quite a few visual applications back then back at Intel when we were starting from the very simple cocoa data set example that a lot of machine learning people use for training, going all the way to complex medical imaging application. They all the metadata and the connections all look like graphs if you visualize their the prominent entities and relationships. Furthermore, these are essentially property graphs where nodes and edges or entities and connections have key value properties and can optionally be grouped by classes or tags. Back at Intel, we had built what we call at the time PMGD or persistent memory graph database in order to target non volatile memory or persistent memory that Intel was going to launch. Since that particular memory hardware gave us an opportunity to address the traditional disk latency that were plaguing contemporary graph databases. And, you know, and it was a good demonstration of how you could use the non volatile memory. That's actually around the time Andy and I met and we were because we were working on two different types of databases targeting the same hardware around similar timeframe. Anyway, since non volatile memory was not not mainstream at the time when we started working, we also introduced durability with DRAM SSD combination. Our graph database is as a compliant graph database, due to its data model. It is very easy to evolve schema for this database by allowing you to add or update any property within nodes or edges at any time, which is really important for visual applications because every time your model gets better, you can extract new information and you want to go back and update your original data. We modeled, we modeled its API on Neo4j was a reference at the time, we're looking at how you can add node at edge, set the properties and such, and some other API calls. We also implemented graph traversals, set operations and tools to manage loading and reading graphs. While our graph library and the API can be used in any application, we have used it within aperture DB for storing metadata. So the graph API is essentially hidden behind the scenes of, you know, it's behind the native API that aperture DB exposes to the users. Now, designing a graph database and data structures that go within can be a whole talk by itself on why you chose something, why you didn't choose something. I will just briefly go over some design points here. And of course, when we were building this, we had to be aware of persistent memory characteristics in terms of how it was slower than DRAM, the lower bandwidth, but it would be load store accessible and much faster than this. So we had to try and maximize benefits of processor caches, avoid unnecessary writes, but also get rid of serializing and deserializing code that traditional graph databases actually have to do when they are persisting their data sets to disks. So what we, so some of the, some of the more prominent data structures in our system are like we allocate fixed size nodes and edge objects that align with cache sizes on top of memory maps parse files that map large physical pages. Now you have to remember all this design started out targeting persistent memory and some of the stuff has evolved as we have been building and fixing certain things internally. But what I'll talk today is what we built at the time and what's actually open source and available if you want to take a look at it on the Intel Labs GitHub. So we manage the restart, you know, so we have persistent memory data structures that are available, the pages are mapped on read write access. We managed to restart by using fixed persistent memory addresses in virtual address space beyond terabyte so that they don't conflict with other applications. At the time we used exe for file system with direct access extensions. Since a lot of, since a lot of cloud systems do not yet have persistent memory, we also introduced msync to disk to support data MSST combination in addition to cache line write backs and memory flushes. Intel removed the need at some point to ex do explicit write to persistent memory and so we just call as fence in transactions if we are actually getting to work on a PM based system. Now, our allocators manage all the non node or edge space to allocate the properties indexes, and we have a combination of allocators for that are like fixed size for internal data structures and variable size for properties. And there is a third type for large page allocations, in case you have large blobs as properties, the allocators for fixed size are modeled after J malloc and for variable size they're basically our own internal algorithm which I can talk about later if someone is interested. Then a lot of the times user queries just want to find the top 100 1000 results. Since it's better to avoid a necessary accesses we implemented lazy iterations on fines. Now you might want to query nodes or edges in a class like person, or with a class property combination like person of some age, and that's why we have two level indexes for nodes and properties and edge class and properties. And there's also an edge index within each node in order to find incoming and outgoing edges, very easily based on classes that also supports fast neighbor traversals which is one of the big benefits of using a graph database. Under logging, which is quite similar to right behind logging that some something Andy probably has already talked about. And then finally, we were aiming for billions of nodes and edges, which meant if we used per object locks, we would fill up our DRAM just with lock objects. So we implemented stripe locks that map larger memory area to a lock and use address spits to figure out which of the index to go into for a certain lock. We use we use reader writer logs throughout our implementation. Given the different access patterns of the various data structures, there are different concurrency mechanisms throughout so for example if you are trying to get access to an allocator it means you're primarily going to use it for rights. So we just assign an allocator that's owned by the transaction for the duration, whereas there are finer grained locks within our tree indexes so that you lock only up to the point that you're going to modify the tree. And like I was saying that have been some changes over time as we've been fixing for performance reliability and any admin tools related stuff. And we can talk about that offline too. Now, I wanted to bring up just this one graph for performance. We kind of ran against time constraints for publishing the paper we were writing on this graph database. Because well, I left Intel and the and when I had time Intel wasn't ready to publish and after that I just company stuff happening. So, just to give you a quick idea of why we chose PMGD or the persistent memory graph database for our Metas data storage instead of going to some other graph database. We used there is a there is a benchmark call social network benchmark provided by ldbc which is the link data benchmark Council, and you can tune how many nodes how many edges it simulates it's modeled on a social network defined by Facebook. It has nice so the one we use for this evaluation had 99 million nodes and 655 million edges and there were large properties on each one of those. At the time when we're doing this evaluation Intel had an emulator which would help you emulate persistent memory characteristics. That's the thing that we used and comparing it with Neo4j at the time are query times. Sometimes we were 14 times better on very common social network queries using PMGD compared to Neo4j. And even for throughput, like you can see in the graphs here, especially if you're doing a lot of read only queries which is very common in the workloads we encounter. This is over 3x improvement compared to Neo4j when we are running with 48 threads on the system. The rights drop the bars a little bit but there's and and that's the part we have improved internally within ApertureDB. So this I will leave it at this for the graph database and move to the actual data management so managing images and videos. Yeah, for the social media benchmark like how much does that look like the workload you're running on the visual data in ApertureDB. So the social network that evaluation we did just with the graph database part so like I said you could just use that library separately. I think we created you know there was a Java based benchmark and we had Java bindings so we created links from the Java binding to run with PMGD and then and they were already existing binding someone had done to Neo4j and we just launched those two separately so that wasn't in the context of ApertureDB this evaluation was separate. My question is like how much of the queries you want the customers want to run on the metadata that you're collecting now in the product. Do those queries look like what you showed in the benchmark? Yeah, so those queries actually yeah they are in some ways similar because ultimately what they what the LDBC queries do is find a person find all the posts that the person did and you know count them or something in in case of you know a retail benchmark for example it can be found or find all the products with the label something and find all the images so they are they simulate that whole find a thing, traverse neighbors. In that sense they are similar of course then you have to go access the image and it changes the numbers. So the visual compute module it manages the data itself images, videos, feature vectors, storage can be on file system or on an object store like S3 or Google cloud storage. We support different image and video formats and encodings typically using OpenCV or FFMPEG. We have implemented some of our own functions too but we try to reuse because these libraries are very optimized to the architecture we run on. We can use the metadata in tandem with the with videos to index keyframes and that helps us enable faster sub video access. This module also expresses it also exposes API for pre processing operation so this is where we support the resize, rotate, sample and and dealing with different encodings and containers and formats. So this is the module that essentially encapsulates all that data operation. Now an increasingly useful and machine learning driven feature is the support for indexing feature vectors or descriptors that describe an image or a frame or a bounding box within them. Now these descriptors in case of visual data are typically extracted with using some later stage of a neural network. So we support different indexing techniques that trade off accuracy and speed of searching by introducing persistence in Facebook's face library indexes and we support some of our own sparse and dense indexing formats. The visual compute module then exposes functions to find K near neighbors in a dimension agnostic manner so you can use you know so the way our customers use this is they'll sometimes have 64 dimensional feature vectors sometimes 128 and they'll use the different you know they'll index them differently and try to find which distance metric works better, which method of extracting embedding works better, they can do all that stuff. And the feature vectors also have a representation in our metadata graph so imagine each feature vector that you extract is connected to what it represents, and then you can use that to find like find me the closest matches and then go find the actual images or bounding boxes that that belong to. So that can you know that can be another way of filtering and finding the data. So that's kind of like the salient points on the visual compute module. The next one, and you know something that ties all these components together is the query engine or the orchestrator or you can call it the opportunity server. It provides. It's responsible for providing transactional guarantees across not just you know the metadata that's handled by the graph, but also any visual objects like images descriptors of videos that you send as part of a query to aperture db. These, those guarantees are enforced by the by the server. It's also responsible for caching data as required, especially if you're accessing some slower media, especially the data lives on some slower media. We also support batching API so what you can do is multiple clients in parallel can launch a search but only one parts of that data to operate on which is very common for training work training workloads. And so we support that and this the query engine is responsible for caching the next patches and returning them as required. It also implements role based access control. It's also responsible for logging monitoring information. That's where we also enforce API validation and enforce any types. And this is where we support error handling so there's a lot of logic that goes into this but a lot of this stuff is pretty well known in database community. So what I want to talk about here are what makes our query engine different. And that's this notion of visual first JSON API that we have defined. We were faced with the choice of do we try to use SQL do we try to use graph query language. But what we realized is we wanted to support not just metadata but operations and all of these things kind of pointed us towards using JSON and defining our own API. So user applications typically Python base or C++ if needed, don't really see any of the components we talked about before. They just interact with this engine through the visual first API, a sample of which is shown here. So if you look at this, you can actually this this is a fine image command where you put metadata constraint, and it actually returns you the image as well. So you can use one command to represent not just the image but also the metadata describing it. Another is just to show how we make use of this is like you can not only apply constraints, but you can apply these operations as you're accessing and then the JSON API takes care in the server takes care of finding the right image, applying those operations and returning the image in the process format to you it does the operations on the fly near the data and we'll see performance benefits from it. And you know, if you go on our documentation page, you'll see the JSON API supports at find a bit removed for application entities, videos images feature vectors bounding boxes and any relationships among those. Overall, we've now introduced API set varying levels of complexities for our different users in combination with the in tandem with the query engine and the client module so we have our native API. But the Python and C plus this connectors help us build object mapper API, which is kind of similar to SQL alchemy, and then we have a rest API for the web web access. Let's now switch gears to what our design decisions give us in terms of performance. So some of the common queries when say creating training data sets, they involve searching for some keywords or labels from metadata, finding the associated images and processing or augmenting them before training a model. That's what we've been talking about all this time. All these operations are offered by aperture DB through its unified API, which is also true for the open source precursor of aperture DB that we started with which was which we called it BDMS or visual data management system back when we were at Intel. So the numbers that picked up here today to show you are from our recently published VLDB paper from last year and in the end I'll share a link. So these numbers are from BDMS and we've actually made the performance even better over time. So anyway, in order to compare at least some of the features that to a DIY system that we've heard people say are built today. We set up our own parallel DIY system where the metadata searches were supported by postgres images were served through Apache web server, and they were processed into the right format using open CV, which is a very popular image processing library. We evaluated our representative queries on both the systems using the Yahoo Flickr Creative Commons data set, which has around 100 million images and videos, which is a pretty commonly occurring scale that represents a wide set of workloads that we have encountered in industry. We were up to 35 times faster depending on query complexity and 15 times faster on average. And I'll try to break down some of the benefits and where they come from but the paper goes a lot more into detail on how we do in terms of database size how we do in terms of comparing to MySQL, where are the benefits coming from and some variations on the different queries. So earlier we showed your comparison of a graph database against Neo4j, which is another graph database. But we also have some examples of how our metadata searches, especially since this metadata is connected in nature, are better supported by our in memory graph database, where you no longer have to do joins on the queries. So the queries I highlight here, so the red line boxes, they do not involve pre-processing. So the benefits are not coming because you are saving when you pre-process data near where it lives versus where you fetch it to the client. Here it's just filtered by metadata constraint and return images as is. So the queries that I highlight here, they just do a filter by metadata and return the images. So when compared to this is with respect to Postgres, we see orders of magnitude improvement when using 56 concurrent threads for queries and at the bottom it's how the queries vary as the database size increases. So when we get to the 100 million images, what happens versus when we started. And then there are different query combinations. I don't have time to get into those but the paper describes all of these queries and what each of these numbers mean. Since we can perform, so we were just talking about pre-processing near data because our API gives you the option to specify pre-processing operations, you can do that pre-processing near the storage location rather than bring data to the client and then apply the pre-processing. Frequently, given the nature of these workloads, what ends up happening is there is a high resolution data sitting somewhere in storage. You have to fetch it to the client and then reduce it to a very low resolution because that's as much as the neural network can accept. So what we have seen is because we can specify those operations and perform them near data, we can get sizable reduction in network usage because the common case is high-res storage to low-res neural network or thumbnail displays. So basically because of the design decisions and how we've combined different things and implemented these things, ApertureDB can be very high performance so that the data pipelines don't become a bottleneck in the whole machine learning set up. There are some other metrics that we have also evaluated over time. These are using some of our customer benchmarks. So one of them is like, you know, how easy or difficult is it to use, create something like this. So one of the customers, what they told us was building a system to store that images, metadata and feature vectors, even trying to use some off-the-shelf components would have cost them at least six to nine months for the API that they wanted to expose. And using ApertureDB, they were able to create significantly simpler machine learning pipelines for the overall task and just focus on the machine learning side of their problem that they needed to solve. Using this customer's data, we have also tested ApertureDB for a scale of over a billion metadata entities and over and as many relationships among those entities and over 300 million images. And we don't even know if that's like we haven't tested it to its limits because it keeps scaling with the resources you provide provision behind. So that's kind of what I have on terms of performance and you know we can discuss more in depth, but I wanted to talk more about where we are going next. Ever since we started developing this technology four years ago at Intel, or five years now, all of our focus was on creating an awesome database core since users would be entrusting us with their data and we want to ensure reliability and performance of this data access. Now that the core technology is solid and getting traction, we have been focusing on usability because we want to make it accessible for people across an organization without them having to learn query API specifics and worry about transactional semantics and things like that. So, you know, and we also want to do this part together with user feedback, as much as we can and that's why you know it's really great to be able to work on these things as we work with customers. That's why, over the course of next year, we'll be enriching our object mapper layer will be adding more enhancements to our UI. In terms of machine learning side will be introducing more integrations we've worked with pytorch we almost work with TensorFlow, so we'll keep enhancing that and work with more frameworks. We're also introducing other features like complex regions of interest so right now we can support bounding boxes we can support. We can support storage and retrieval of polygons but there are other complex API's that we can do with polygon regions of interest. We are working on weighted similarity search. There are also some database features like introducing user defined functions to execute near the data that are caching and query optimizations, of course, cloud scale, keeping everything transparent to the user so that they can keep working with the system while we scale underneath. And then there are, you know, our customers provide some really great feedback in terms of the API that they would like to see. And so that also is a significant enhancement source of enhancement that we put in our roadmap. Our journey and growth is also related to our transition from researchers to entrepreneurs. A lot of the learnings have been on how to approach understand and work with our users to build a product that really simplifies their lives and gives us plenty of technical challenges to keep us occupied for a while. So if you have some, if you have cool ideas, you want to develop on the system, or you want to deploy it, please write to us. That's the email and if you scan that QR code will give you some other links. I found this really cool on some of the other speakers so I kind of borrowed that idea. But that's, I'm ready for questions. Let's open the floor to questions. We have plenty of time. Yeah, I have a question. Thanks for the talk. I'm curious what it would take to like deploy GPU model with just a basic PyTorch ResNet image classifier using Aperture DP. What would that look like? If you're talking about a GPU optimized model, it's something probably that PyTorch would handle. And if you have a machine that is provisioned with a GPU, the data side would still work the same with Aperture DB. If you're asking if could we optimize some of the functions on the GPU ourselves? No, not that. So basically what would get stored, like what's the difference between your supporting the OpenCV workload and supporting a GPU based workload? Do you have like benchmarks for that? We don't yet have benchmarks for that. We have had plans for, you know, exploring how can we optimize some of the operations using other libraries and perhaps using GPUs. We haven't gotten there yet. Okay, so the benchmark that you ran with OpenCV running, that is like all within Aperture DB? Is that right? Yes. So we were supporting pre-processing operations in that case. So the model implementation. So PyTorch was running outside in that particular example. It was accessing data that was stored and searched using Aperture DB. And what we use OpenCV was for, let's say, resizing the images to the pixel size or dimensions that the AlexNet model in that example used. Okay, cool. That makes sense. Thank you. So right now what happens is let's say you have, you know, you can, you have to use PyTorch functions or something to change your data or augment your data. What we are saying is you can offload some of those functionalities to Aperture DB so that let's say all you're going to do is fetch a high-res image and then reduce its resolution after making it travel the network, you don't have to do that anymore. You can offload some of those functions to Aperture DB because we support a lot of the pre-processing and augmentation that's common across different applications. Okay. Okay, maybe this is what I meant. Like you're not actually running the full PyTorch program. We don't do it in turn. No, it's not. Okay. It's not like you can specify run PyTorch in our API. You use our API to create the data set that gets fed into PyTorch and use it. Got it. And of course, which means that he's like, yeah, so you push down some constraints or some filtering so that what gets fed into the Jupyter Notebook for this running PyTorch is, you know, a somewhat prepared version of the data set, even though PyTorch has to do all of the heavy sort of crunching. So the machine learning side of the crunching, yeah. So if I show you, I can show you an example since we have some time. So, you know, normally what would happen is, so we have this, this is a JSON-based example, right? And we've just used a simple classifier because the point was to show the data side of it. And in here, just this complexity of storing the bounding box, being able to search with it, finding the image that it relates to, that's the part we have addressed from the data management perspective. And that's what gets missed because a lot of the conversation you're right is centered around, you know, it's hard to configure machine learning models, it's hard to monitor what they're doing. That's a whole area and body of work by itself. But then there's all this complexity on the data management side, which it's just like, you just pay that price, because you don't think about it as actively. And that's where we've realized that you actually do end up spending this price, paying this price repeatedly. Because you do that, you know, even after you've trained the model, now you train it with, you fetch different images, different, from different sources, and you keep paying the data engineering price multiple times. And so here if you see, we connected the database, we find, let's say here we want to train with, you know, pictures that contain horses. We take that, and we just create that data set. And PyTorch has this notion of data sets defined. And then behind the scenes, fetching things in parallel, batching them, accessing them, you know, as the next stage starts, all that is taken care of by our tool. But the training part or the classifying part, that's still the model and PyTorch runs that part. Follow up on that. So if I wanted to deploy a new model, how would that work? It would, I deploy a new model, run it over all the data in ApertureDB, and then have a duplicate set that was some differences between the two models. How would I then disambiguate my ApertureDB query over the results of those two models? So, you know, one of the, actually, that's one of the ways our customers have kind of been using the metadata part to the great advantage, and we are actually thinking of adding a data set API is you can actually create an entity or a node called model and connect it to the data that you used. And our edges allow you to put properties on them so you can indicate, you know, any properties of the model that you want to capture. And then you can have another similar entity and then the queries will tell you which data that the models run on or what were they trained on. And if you use the model for classification, those edges is where you could store the accuracy of your classification and you can find things like which model classified the same data with a higher accuracy. Okay, yeah, that makes sense. Thank you. Your customers want you to run the PyTorch stuff for them or no? They have their own PyTorch stuff, so they're happy that we gave them the data set. Got it. Yeah. Okay. This is absolutely not a fake that I know a lot about. Do your customers every answer to support like, like, I guess, could AvertureDB connect to like an existing gam? Like I'm thinking like Ivan Dora is the one I'm probably most familiar with in like digital asset management. Do you see yourself, I mean, it's not exactly the same thing, but like, there are, you know, because they're, but they are starting metadata. Yeah, so we actually, that was one of the design discussions when we deployed this at the customer example I talked earlier. They had their images living in an asset management system, which wouldn't allow them all these functions like, you know, data scientists, they want to be able to go put new embeddings into come up with new classifications. It's not easy to modify asset management system because they're very tied into a certain image storage and very marketing oriented rather than data science oriented. So their question was, can you put a link in the metadata into the images that are stored there and access from there, but ultimately the choice became like, okay, what's going to be faster? Is it easier to just replicate and then store it in the data managed by ApertureDB? Or should we access and in their case, that performance of accessing was more important than having to replicate. So that's the choice they made. We have, we do have use cases where we have to link directly into something that's living in cloud buckets. But that's fine. We can support that. That's where the data is. Yes. Yeah, that's expected. Yeah. And I mean, it goes back to the conversation you had, you know, I had a while many years ago, but like, it seems like the, as you said, the usability part is sort of the most important thing. And what's underneath the covers, the data assessment itself, that's all abstracted away because it's hidden from the developer. You have this API of this orchestrator. They don't actually see what's underneath. So like how complete is the underlying graph database right now? Are you spending your time building that thing or the thing up above, like the orchestrator, the API, the usability stuff that the users actually see? Most of the work in the last year has been on the top. Occasionally we have something, you know, we have to enhance the graph database because well, oh, this is a different scale of concurrency that we seek and we handle it well or we need a different tool to understand what's happening or migrate something. But a lot of the enhancements have been on that query engine layer because, you know, building the role based access control, all the logging and monitoring support, all that stuff went above. The metadata part has mostly remained stable over time. I see we'll have, you know, we'll have more and more work as we scale out. And that's part of something that I'm, you know, looking forward to in the product roadmap for sure. Yeah, we're on that. That's from a certain standpoint that fun from a startup standpoint, like if you make the underlying graph database better, like, you know, like the customer doesn't see that. I mean, if you, so I feel like the effort should be up above. You know, so yeah, I mean the work that we have been, that's why a lot of the work we've done has been on the top and plus, you know, like we were talking earlier before the before the presentation started. There is an underlying expectation that stuff will just work. But what matters is that the API and that's why we built this web front end. We are creating this Python object layer because that's what the customers interact with. But underlying, they just expect, oh yeah, if we put 50 million images or 100 million images and 300 million entities in the graph, that'll just work. And that's why for us behind the scenes, then we have to make sure that it, you know, keeps scaling and what are the limits to the scaling and that's why you know we have evaluated to like the billion entities. Those many connections so that we can keep, we can keep scaling before the customers get there. Would you better serve just switching to like snowflake or something, right? Because they'll make their system better and you just get all that for free. I mean, free in quotes, you know. Yeah. We do, I mean, you know, right now the data model that we chose, the reason we chose the graph database that was very much like, you know, we did look at what our alternatives were and would they fit as easily? Would they make sense? Would we have to contort ourselves a lot to make the metadata fit in that structure, the data work in that way? Over time, if it so happens that okay, you know, scaling our own system components is just so much more work than using something, we'll have to evaluate. Like, you know, we do internal performance evaluation and make sure that the components we are using are really the most suitable components for the workloads. All right. Any other questions from the audience? I got one more last one. Go for it. Thanks. Yeah. Okay. So you mentioned about how scaling up to like 100 million images is, you know, just works. I'm curious, having worked on vision teams before, oftentimes you'll end up with 20 many potentially many more different developers working on different models that actually do different tasks with different, maybe non-coco metadata formats. How would your, what would the data management look like in this graph representation and what would the queries begin to look like when you start having to add notes to manage all these things? You know, so in some ways, if you look at the property graph model per se, you can introduce any type of application level entities and it just keeps, like you can just keep adding your own classes. That's kind of what, even now in our deployments, it's not just one team or one data science users, it's usually teams. And some teams are focused on adding new annotations and training with, you know, annotation based queries. Sometimes they're adding new models and evaluating models. Sometimes they introduce new products and introduce those connections. So it can keep growing and it supports all those. So sometimes they bring in public data set. So, you know, there are images from Cocoa and they'll just indicate it with the property name or indicated on the edge key value property. There are different ways and then we work with our customers on like, okay, this is the best way. These are the indexes you can build. This is what will simplify your query. That's the kind of things we can do and it just, the data model just supports adding any different types, any number of different types of entities and connection types.