 open datasets. And the second problem is there's also inconsistency between the representation of the data set format. And the last one, the third one is there's the format are also inconsistent when we write the data consumption code, for example, in data centralization, in data training, you know, running statistics on the data and doing a real-time inference and evaluation. So on the on the left bottom corner, we actually show a graph and this is exactly how it works today where, you know, Cocoa and VOC and Kitty and CityScapes, they're already famous datasets. In the platform, we thought it will be a great idea to give back to the community by opening up the capability, by opening up the platform to the entire community, you know, to the people who want to share their open data, they want to find a place to host their open data for free. And we want to give back to the community by open this platform for everyone who actually want to have access. You're still working really hard to working on that part and to build a community around open data. A long way we actually encounter a bunch of different hurdles, you know, including issues on the open data licensing, open data standards, and some of the other problems. That's why we created Project OpenBytes in 2021. We are aiming to use this form of organization to bring together all the people who care about the development of AI, who care about open data sets together to kind of trying to solve all the problems, you know, that blocking making open data sets more available and more accessible. We also encourage in different organizations of any sizes to share their data. So the researchers and other industry practitioners can use that data to do their innovative jobs. You can always reach me at my working email, which is at word.tree at gravity.com. You can also follow me on LinkedIn and Twitter. So in my today's talk, I actually have a relatively short agenda. In the first section, I'm going to share with you guys the challenges we have when we're working on the open open data sets platform. And then after that, we're going to share, you know, our source behind how to solve those challenges. Our proposed solution is to have a unified representation for the entire community to work with all different type of data sets. The next section, the third section, we're going to talk about Portex, which is a schema definition language we developed in-house. We already open source it to the entire community to try to solve those challenges. And then the next section we're going to share about the roadmap we have for Portex. And at the end, we kind of want to talk about, you know, as a member in the community, how you can participate, helping us to make Portex better and helping us to make open data sets more accessible and available to all the members in the entire community. So the first section, we want to share about the challenges we observe when working with the open data sets. Let me give a quick introduction of the current state of the open data sets. When we're building this open data set platform, we actually have three major issues. The number one is we found oftentimes there's inconsistencies within the data formats from all different type of open data sets. And the second problem is there's also inconsistency between the representation of the data set format. And the last one, the third one, is there's the format are also inconsistent when we write the data consumption code, for example, in data centralization, in data training, you know, writing statistics on the data and doing a real-time inference and evaluation. So on the left bottom corner, we actually show a graph. And this is exactly how it works today, where, you know, Cocoa and VOC and Kitty and cityscapes, they're already famous data sets in computer vision. The problem is they all have totally different formats. They all have totally different definition of how their data are organized. So when we're working on visualization for those data sets, when we're working on the model training code, the evaluation code, we have to customize those visualization for each different specific data set definition. And that basically create a huge hustle for developers to actually understand what's inside the data, you know, the relationship between the data sets. And also it's really hard to build scalable software, which will work for, you know, which will work for all data, and it's hard to share the code that process the data, because you have to do the data, you either have to do the data conversion, or you have to retrofit the model to the specific format defined in that data set. And neither case is ideal. So let's give you guys a much in-depth overview of what are those three issues are. The first issues we are talking about the inconsistency between the data formats. Let us to use VOC, Coco, and Kitty as the example. First of all, you can see they're all saved the data in, especially the annotation in all different formats. For VOC, we have the XML, and for Coco, we have JSON. And we have JSON. For Kitty, we actually have a text file. And then for the first two, you can look at the data itself by looking at a few of the names of the field. You can actually guess what's going on inside the data. But for Kitty, without rating the documentation, you basically don't know what each field actually means. And then the funny thing is, even for the same type of object, for example, you know, in computer visioning object detection, we always use bounding box to represent where objects are. And even for the, for the same bounding boxes, we could have different type of definitions. For example, like VOC, I actually use the laptop top corner plus the right bottom corner, right? That's why it has X mean, Y mean, and X max and Y max. In some of the other definition, I actually use the center of the bounding box plus the width and height of the bounding box. And in other definition, I actually use the top left corner of the bounding box and plus the height and width of the bounding box to represent the bounding box. And sometimes the width and the height, they are actually in different scales. Sometimes people use absolute pixel values, and sometimes people use a relative scale, a ratio from zero to one to represent, you know, how big the bounding box are relative to the image itself. So it's actually really hard to keep track of, you know, what's going on inside of each definition. Sometimes you have to always go back to the, you know, to the documentation page and trying to figure out, you know, what's actually inside the data. So you'll be able to write code that can consume the data. And the second issue is there's definitely inconsistent format representation beyond, you know, the different definition for the same object, the bounding boxes. There's also other ways to kind of give more information about what the dataset has and, you know, what the specific dataset means. On the left-hand side, we show you how Coco give that information to the developers. It actually has a data format page where I kind of give a very detailed explanation of what they use in their JSON and what each different field actually means and what information it actually contains. On the right-hand side, we basically show you the documentation for Kitty. For Kitty, it actually has an entirely separate system where you have to go to the documentation and look at the documentation to understand what each number actually means in their TST file. The problem with this type of definition is inconsistency between the documentation and the data itself is relatively easy. It happens all the time. For example, sometimes the people, when they explore the data and save it into a TST file, they forget about, you know, updating the documentation. And we saw that inconsistency all the time. And sometimes we have to, when we process the data, it doesn't look right. And we have to go back and email the, you know, the users, the contributor of the dataset and ask them, like, hey, what's going on inside the data? It doesn't seem to be right. And it's also hard for them to kind of keep tracking of what's going on. Sometimes they get back to us and say, hey, yes, indeed wrong. And then you have to, you know, rate the data in this way. It doesn't really show up in the documentation and creates a huge problem. And the number, the number three problems we have are, you know, when we consume the data, you know, some of the model, we don't always design the model from scratch, right? Some of the model are designed by some really talented researchers either from academia or in the industry. We want to use their model as the variable for our own applications, right? We go to GitHub to find such models. But those models, they are all kind of required different specific format to train. For example, you know, V5, they actually require the cocoa format, the swing transformer V2 require the image net format. The list goes on and on. And when you organize your model, when you organize your data in different ways, it's sometimes kind of hard to reuse the model trained on other specific types. And then to solve these problems, we saw in the industry or in the academia, people are writing a lot of boilerplate code to do data conversion. For example, BDD100k actually provide a toolkit to convert data to cocoa and scalable formats. And then, you know, if you search on GitHub, you can actually find really a bunch of tons of GitHub repos are talking about converting VOC, you know, data format to cocoa and vice versa, right, converting cocoa back to VOC. And if we keep, you know, creating new type of formats, and this list will go on and on forever. That definitely creates a lot of problems when we sharing the data, sharing the code that process the data. So how should we solve those challenges? The source we have is we definitely should have a unified representation for all the data sets if we have that, we can solve this hurdle's challenges. So what is a unified representation? I think it has two parts. The first part is we have to define a unified data structure that is shareable, that is reusable, and it should have much better readability. For example, it shouldn't be like a TST file with a bunch of numbers. When you just open the file, you have no idea what's actually going on inside that file, right? And the second part is we need to have a unified inventory on this layout so we're writing the code, we can guarantee the code will work correctly. And then we can actually share that code between, you know, different people, between different organizations. So what benefit from a unified representation? We believe that's the reusability and the shareability because right now, like training a model is super expensive when we're talking about GPT, when we're talking about other big models, right? It just consume a lot of power, a lot of time to train, right? So maximize the reusability, maximize the shareability is definitely go alongside with the spirit of open source. And then we kind of use the same graph in the previous section to show once we have a unified representation, we can make the data consumption much easier, right? We can have unified visualization that can be used to visualize the data not only from Cocoa, but also from VLC and Kitty and Cityscape. Well, at least the object detection part should be the same, right? Then we can share the processing code, we can share the model training code, we can share the inference code. But this is great, this is nice, right? But why don't we use existing solutions? What make the existing solutions don't work in those scenarios? In this section, I want to give you guys an example and that basically shows why the existing solution doesn't work no matter whether it's JSON or the type of interface definition of which. In this section, we actually use the VLC dataset and docs and CAD versus CAD dataset as the example we use, for example, we use protobuf to be the ideal to define the schema for the data because it's actually strongly type is slightly better than XML and JSON. And then for protobuf on the left hand side is the definition of the VLC dataset in protobuf. And then if we want to, to be honest, like the VLC dataset and dog versus CAD dataset, they basically are the same task. They're both computer vision tests, they're both object detection tasks. The only differences between those two datasets are on the left hand side, the VLC dataset has a lot of categories, it has a lot of, you know, classes for different type of objects. But on the right hand side, the dog and CAD only have two categories, the CAD and dog. And other than that is pretty much the same, that the data are organized in the same way, right? But reuse the definition is hard though. For example, VLC is on GitHub repo, it has, let's say it has this perfect protobuf definition. But if you want to use in the dog versus CAD scenario, you have to fork that GitHub repo to your local and you have to copy the protobuf definition files to local working directory and then you open up that file and modify the necessary field to match your requirements and delete the unnecessary fields, right? And then by doing so, saving that back to some other protobuf definition file and put the entire repo on GitHub, there's basically no link between, you know, those two, you can't never know whether those two type of datasets, they are actually using a similar file structure. And then you basically kind of reuse a lot of the code that was, you know, was created for the VLC dataset directly on the dog versus CAD dataset, right? Because there's no link between the two other than you open up both of the protobuf definition manually and kind of compare the code inside those files and be able to figure out they are actually sharing the same structure, the data are actually organized the same way and the dog and CAD dataset are probably borrow the definition from the VLC, right? And that's the problem we are having in the industry right now. If we use any type of the current solution to store this type of data, the problem will be no one will know exactly where they come from and then no one will know, you know, whether it's from the existing dataset, whether we can reuse some of the existing code, existing model and make it supreme efficient because the developers, the researchers, they have to write their own code again and again and again. And that doesn't really create much value. It's not a good use of their time. So here is a, I just want to go through a quick list of what we think the issues are with the current solutions. You know, we definitely need a unified representation but the current solutions has a lot of limitations on that. The first limitation is a liking of reusability and shareability. For example, the VLC definition cannot be shared with other people. Other people cannot import that definition or import part of the definition and use in their own projects and be able to leverage the code that written for you know, part of that part of that definitions, right? And then the second issue is actually does not support parameterization when modifying the types. What do I mean by parameterization? For example, in the VLC, we have many different categories, right? And then for the dog versus cat, we only have two categories. Is that possible for us to share the backbone of the definition of the dataset? Instead, we only to modify the part that's different. It's impossible with JSON. It's impossible with protobuf. It's also impossible with XML. So well, it's impossible in a manner where, you know, it's impossible in a manner that, you know, we can define, you know, some base types and every application-specific types, we can just instantiate that base types with application-specific, you know, data structures. And the third one is there's no hard guarantee for a lot of the types, the formats we use to store the structure data, to store the open dataset. For example, JSON and XML and text file, you can only put very simple string data or numbers inside those definitions. And none of those definition or technology are strongly typed. So you don't have any guarantee beyond you have to, you know, ask the user to be really careful when they're writing the data to those documents or you have to write, you know, a lot of the code that capture all different color cases. You don't, you never, you will run through all different issues when you consume that data at scale because there are going to be so many corner cases you have to interrupt the workflow and capture and write code to capture all those cases one by one. And the last one is, it's also really hard to store relative complex data. For example, like sometimes the data annotation is not, it's not as simple as a bounding box, right? You have many different objects, you have many different tables, you have many different properties, and you want to, they're all kind of related to each other, but they don't, the dimension add, it doesn't really match with each other. You have to store, sometimes you have to store those data into separate tables. And when you query the data, you will do a lot of the joint operations to assemble the data together again and be able to consume the data. For example, new things is one of the case, you know, in this scenario is actually use relational tables to define the entire data set. And every time you want to search a part of the data set, you have to basically do a lot of joint queries. And then that's not really convenient for a lot of the, for a lot of the use cases. So what's our solution? What's the solution we proposed under the project of OpenBase or Wissing for Abdi? That is products. We are designing a new schema definition language for data sets. If you want to curious about the syntax that language itself, please go to our documentation site and learn more about it. In my opinion, it's pretty elegant, and you can actually use this definition in many different places. So what is Portax? Let me give you a quick introduction of the Portax schema definition language. Portax is a unified schema definition language. What do we mean by unified schema definition language? There's actually two forms. One is unified, not only designed for the data annotations, but it's also designed to, you know, for the raw data, for the metadata associated with the raw data, the images, the audits, the tags, and the camera parameters, the LiDAR parameters, the LiDAR calibration parameters, the intrinsics, the intrinsics, the LPE, you know, vocabulary. You can basically put any type of data from the raw data to its annotations, to its metadata, all in the same format and all managed in the same way. And the entire community can be able to reuse any part of it. And the second part, when we call it a schema definition language, we basically envision, we want to save the data in a columnary way where you can imagine that data will be saved into a huge table. And then we use Portax to basically define what are each columns and what type of data people can save in each of the columns. And each of the columns can be expanded into another table, into multiple, you know, other columns. And that's pretty much the goal. We want to have this system where we can store very complex data all together. We can use this Portax to describe the entire dataset, not only the annotation. And why is that and why this Portax is important is helpful, because that kind of enables the reusability and the shareability of the dataset definitions, which further will enable shared data processing and model training code, which is going to be really helpful for the entire community kind of to collaborate on building the models and sharing the models, sharing the knowledge, and collectively, we can be better. So here are some of the key features of Portax. The first of all, like all the tabs in Portax are composable. And we have a tab import mechanism which works with GitHub repo. In your GitHub repo, you can create a folder or the entire GitHub repo. You can make that be a place to store the definition of some of the existing Portax formats. And then in your other work, you can always import a tab from any existing GitHub repo. I can show you some of the syntax. It's really simple. It's easy, it's elegant. You can import a GitHub and any tabs within a GitHub repo. It's kind of really similar to Python packages. So the second feature is we provide a templating feature where we can define the bare bone of some data types and use templates to initiate the data structure for different applications with a specific definition of the field. For example, I will give you details in the next few slides. And then the third one is we want to be able to support a multi-dimensional table which basically means we can put another table into each cell of the tabular data model. And we can put all the complex data together so we don't have to have multiple tables and have to have the data scattered into the multiple tables. And when we do queries, we use the relational database way to do a lot of drawing. And that, first of all, that's not efficient. Second of all, that will easily create data inconsistency between those tables. So let me just give you a quick example on how to import work. So on the left-hand side, you can see if you want to define a bounding box type of data structure, you can create a file which we call vocbox2d. And all the Portex schema definitions are put into EMOs. And then you can see we give it a type called record. A record in Portex basically means it's an object. It's really similar to struct in C++. It's really similar to object in Python. And then under that object, you can have a bunch of fields. For example, the record for, you know, for example, like the really important fields for a bounding box are the x means, the y means, the x max and y max. And then for each different field, you can actually assign a type. And all the types are, all the data will be tracked against the type. In this very example, the types are integer 32, which basically means they actually use the pixel values for the x mean and y mean, not a ratio value. And then let's say in your next project, you want to import this definition. And this definition is sitting in some GitHub repo. And then you can just do a simple report just like Python code, right? You can do a simple report and say, hey, I want to import this repo and I want to import this type in this repo. You can import as many types as you want from a specific repo. And you can import types from multiple different repos. And then your main body, you can actually use that type. For example, here, right here, we have a field which called objects, which is array. And then, and then each item of that array is a bounding box, a 2D bounding box. And for each of the objects and each field, you can basically view that as a column, the object itself could be a column, right? And then you can expand that and each of the field could be a column, too. Think about if you have a dataset that has images that has bounding boxes and you can actually put them together in different columns. So the second feature is the template feature, which is really similar to C++ template. In this section, I will bring back the VOC dataset versus the dog versus cat dataset example. In here, on the left-hand side, we actually can define a backbone for VOC box 2D, which basically has two parameters. It's an object, right? It's a record object. It has two parameters. One is category. The category will be different and attribute. That attribute will be different for different applications. And then in the specific applications, you can import that variable, which I totally ignore because I can't write a lot of code in my slide. But in this definition, if you want to define a VOC type of data structure, you can just say, hey, the categories should be airplane, basketball, boat, whatever the data that the category is defining in VOC. And then for the attributes part, you can define something like post, right? Whether it's on the right, whether it's on the left, and then some of the other fields that the VOC dataset really cares. And for the dog and cat dataset, you can instantiate the backbone with only two classes, only two categories. One is cat, the other one is dog. And then you can also specify the attributes. The only important attributes for this dataset may be whether the object is uploaded and it's a Boolean member variable. For multi-dimensional tables, it's relatively easy to define, right? When all the types are composable, you can basically compose very complex types together, for example, like the image type could have the raw data itself, the binary image itself alongside with the metadata, the camera information, some other information, when the time step, when it was captured, and other things like that, right? And for the box 2D, for each of the images, you basically have limited lungs of boxes, objects inside the images. It could be 1, it could be 2, it could be like 10 or 100, right? And then you can actually put that array, another table, inside the cell, and organizing that is relatively easy by, you know, using the Portax schema definition. On the left-hand side, we basically show you what we're actually putting the data into the schema, what it looks like, right? It has the file names, a bunch of strings, it has the images, and it has the binary data. And for the bonding box, it basically has another table. In this way, we don't really need to save the data into separate different tables. We can put every single data all together and to make sure consistency of the data. So Portax is great, right? What can happen next to Portax? Right now, we are working on the main syntax of Portax is WIP. And then we almost reached the end of the first release. The documentation side is up and running. Feel free to check out the documentation, learn more about Portax, and give us the feedback whether you like it or not. The official public release will be in July. And please behold, and we'll make some noise when we want to officially launch it. And we are also in the meantime working on some in-memory representations, because the schema definition language only kind of define how data being organized, what type of data inside this data set. But when we're working with the data set, we actually need to put into the numbers, right? And it has to have a in-memory representation, which make it easier to be accessed in the process by different code and scripts. And with this effort, we are currently converting the data set to a very popular in-memory format, which is Portax Arrow. It's a in-memory columnar store, in the in-memory columnar type of data format, where it's really maximized for a lot of analytic jobs. And if you don't want to use the raw Apache Arrow format, we also provide you with your memory APIs, like DataFrame. We follow the Pandas DataFrame API and kind of re-implement this DataFrame, so you can use the DataFrame API and the features you're already familiar with to work with the data defined by the Portax schema. And in the future, we are thinking about writing a compiler to automatically generate code, specific code for specific schemas, and a user can use that code in different applications. For example, in the Python, they can use that code to rate the data, feed that into a training process, and maybe when they build visualization, they can generate JavaScript code and be able to rate the data and put that into really nice visualization plugins, and that plugins can be shared with different parties. For example, the bounding box, all the bounding box are looking the same. We can just write once what the bounding box visualization are, and every time if we see that specific type, and the visualization will understand how we can process that data, visualize that data to the users. And we're trying to make a compiler to compile the schema definition to the Portax schema into many, many different languages. So when people are building different applications in different languages, they are able to consume that data. And then in the last section, in the section five, we want to touch base on how we want to work with the community, how the community members can participate in this great journey. There's multiple ways people can contribute. The first way is to participate in the Portax language design. We are still designing a language. We have some really great features. We want to add more so we can adopt more and more open datasets. Feel free to go to our GitHub repos. And if you want to implement some features yourself, feel free to send the pull request. If you want to see some features happening with Portax, feel free to, you know, send a ticket in the issues section and let us know what features you require the most. We'll arrange time to put it into the roadmap of Portax in the future. And then also, you can contribute on the Portax schema side when you're working on new open datasets. Feel free to leave a folder that use Portax to define the format. You'll find it's super useful for you to actually leverage some of the existing technology and process the data, but it's also going to be helpful for other users who kind of want to import your definition in their own works, extend your definitions in their own works and be able to use part of your definition and part of the tool train you already built. And the last one is we want more models to use Portax schema. We want more data processing code be able to benefit from Portax schema. So feel free to, you know, we already have a Portax schema common report where we have defined some of the common data types. Feel free to share the code that use those Portax schema. Feel free to share the models that use the Portax schema. And you can help us to put some classic models, you know, into Portax schema. So the model itself can be reused in many situations as long as, you know, some of the other datasets use the same Portax schema. So always feel free to try out the contribution contributing page of our GitHub repo. And feel free to reach out to us to me to myself to follow us linking follow open bites on linking follow rapidly on linking, and then follow us on Twitter. And we'll be happy to talk with any people who want to join the journey want to participate in this effort. And that's pretty much conclude my speech today, my presentation today. We should all have fun at OSS summit. And hopefully we'll see each other in the near future. Thanks. This is Edward speaking.