 Hello, my name is Kyungmin. I'm here today with my colleague Kelly to present our recent work called Clams and how we integrate a triple IF product into our Clams platform. As an outline, we're going to talk a little about the Clams and the data format MMIF we're using in the Clams platform and how the institutions can use Clams for their own work and then Kelly will take from there and introduce some available apps and the integration with the triple IF formats. So we start with this question, what is Clams? So with the recent technology, there are lots of many audiovisual digital libraries and this work, by the way, we are at the Brandeis University and this work is a collaboration with GBH funded by the Mellon Foundation and GBH is hosting a pretty large audiovisual digital library called American American Archive of Public Broadcast with thousands of videos and radio shows and TV shows. So they approached us to build some kind of platform to provide some easy access to the AI and machine learning tools for them to automatically or semi-automatically extract some item level metadata for their cataloging process. So our goal here is to combine the AI and machine learning tools with the large-scale AV digital libraries to build a next generation smart archives. Here is the schematic summary of the Clams platform. I'll briefly talk about each components here. So first we have this app directory which is currently working in progress. So with the app directory developers can write their AI ML tools and easily publish tool or app to the app directory so that users can easily download the app and use it with their own Clams instance. So Clams instance will come with a portable workflow engine. We'll talk a little more about that later and that provides some user-friendly interface to access these AI tools and create their own workflow and run them. So in the middle we have this MIF. We have interchange format that enables interoperable interoperable interoperation between different apps and the workflow. So with first I'll briefly talk about that because that needs some detailed discussion. So MIF is a multimedia interchange format and that is the core component that basically makes every apps and every actors in the platform talks the same language. So in the MIF we because we are handling audio visual data and text data at the same time so we provide these four different types of annotation anchor mechanism which is based on characters or reason or time-based annotations. There are basic types and there's another component anchor type called 3D bounding box. That's basically some kind of bounding box or region polygons over time. So MIF is the file format is basically just JSON with some JSON-LD components. I'll show you some examples of the MIF. So this is the example of character-based annotation which is named entity. So for each annotation object we have this add type and those add types are resolved to some web registry of the concepts and we have all the properties encoded in JSON format. Also goes for the image annotations and some audio or video time-based annotations. So in general AI and ML practice one tool do one thing but we want a MIF to have these bridging annotations so that the multi-dimensional data source we can look at the multi-dimensional data source altogether to provide some comprehensive information so that the archives can provide some more interactive presentations and maybe researchers can use the data to train their additional AI tools using some multi-sensory learning process. We also encode source data in the MIF as a file pointer and for reproducibility we keep all the timestamps and version numbers within the MIF format so everyone can try to do the same workflow and get the same result. So this is a somewhat detailed but still at the summary level schema for the one single claims instance so each archive will have their own local storage and a claims instance will have this container orchestration system so everything currently is using Docker to be deployed and once everything is processed through the apps users can convert the data from it on their own to their own format and publish it to their metadata repository. So here's one simple example workflow that takes some video and audio and text input and generate timed labels of bases and names and for that we are currently using two we are developing two types of workflow engines one is based on web that provides GUI called Galaxy and the other is written in Python and running one on Batch or any Linux or Unix terminal that is more suitable for batch processing and for those users who are more familiar with the terminal UI. From here out yeah this is the screenshot of the Galaxy GUI to pipe different claims tools to generate a custom workflow and this is an example of the terminal-based pipeline engine. On the left we have a simple configuration file and on the right we can simply start the Python script and run it in the background. Okay I'll leave the stage to Kelly. Thank you Kay. Now I'm going to talk a little about some different claims producers and consumers and so claims producers are tools that produce some sort of annotation so they're taking in an input an MIF input and adding something to it and generating an MIF and consumers are tools that take an MIF as input and do something with it so that could mean converting it to PB core or generating a triple IF manifest and then displaying it within the universal viewer which is one example we're going to show you. Okay so some producers that we currently have been working with for video are end for images character recognition so OCR with Tesseract slate detection and recognition I'll go a little bit more into that one shot and scene detection for audio speech detects using Calde speech segmentation so annotating audio streams with some time frames showing when there's speech versus non-speech versus applause and text-based producers include named entity recognition and named entity linking which is like grounding to an authority file so the first app I'm going to talk about is the slate text extraction so here's some examples of slates actually I'm going to start with slate extraction so ignore the text here in the header so these are some slates these are frames from videos that have some textual information that we could extract and put into some metadata storage so the motivation for this is probably clear we want to find those frames extract the content from the frames and then use those to populate some archive management system so we developed this tool by working with archivists from GVH and they annotated manually annotated some videos with the start and end times for when when slate started ended then we extracted those frames and trained a model to differentiate between slate and non-slate frames okay the next tool is slate text extraction which is another model that we trained for for detecting the actual location of text within a frame so for those frames that we showed with text in them we annotated them with by drawing boxes and transcribing the text so here's an example of this the via annotation environment that we used and here's some just information about the model that we trained finally Tesseract is the final tool in the pipeline that we're going to talk about now so Tesseract is an OCR engine so input an image and get out information about where in the image there's text and transcription of that text skip this slide so I'm going to talk a little bit about the process of converting in myth to triple IF and do that by showing some examples so for time frames such as slates or bars and tones just another tool that didn't go into detail about in in myth we have annotations field and it's a list and here we have a time frame annotation with start and end frame numbers and then within triple IF we convert that to a a range annotation and for each of the documents we have a separate canvas and then for each of the annotations we generate annotations that link to the corresponding canvas and then for bounding boxes within in myth we have bounding box annotations that include the coordinates we also have text document annotations so this is where we have the transcription of the text so this is the the output from running Tesseract and then we also have an alignment annotation which links the the box location to the box contents and in converting this to triple IF we generate an annotation list um so external from the the core manifest and then within that annotation list for each bounding box we generate an annotation with the characters displayed okay so using these triple IF manifests that we generated from in myth files um we have been experimenting with displaying the results in the universal viewer so here's one screenshot example of that and this is a slate frame that has been converted to manifest and includes bounding box annotations um so here we have a a search service included the universal viewer and searched for the word prime and it's found you know the word prime in the first and second frame it should be highlighted but having an issue getting that to to work um and then here's another example then the universal viewer of a video with time frame annotations that were converted to ranges and then the the range is displayed here on the timeline and if this were actually a live demo we could click on the slate and it would jump forward to show where in the video there's a slate so I think yeah that that's it um yeah thank you for listening and we're happy to take any questions so there's one question already on Hoover uh and it is uh is MMIF specific to clams well we developed um as a part of the clams but the the format itself is very general it's just adjacent with some LD linked data components so it doesn't have to be bound to the any clams specific tools or platform so anyone can use it they found a use yeah any use case for their needs are there any other questions or folks will give you a moment to type okay one other question were you inspired by existing workflow engines uh yes so uh at the beginning of the project we looked at existing workflow engines including um Miko and other uh like audio visual media specific uh projects prior projects that worked on the similar uh goals and we ended up using Galaxy uh which is a workflow engine developed in the bio information community uh as our base because it actually quite it's actually quite flexible and easy to adopt for our purpose and also provides a really user friendly interface for any like non-tech savvy users we also have previous work on a similar workflow pipelining project at Brandeis called the lapse grid um that's doing similar work but just for pure text documents um so this kind of inherited a lot of what we used in laps okay and uh we're just about at time but maybe if there's one other question maybe we could answer it really quickly um it's uh do you have any thoughts of how this might map to 3D okay that's a good question so we started with whole project with gbh's collection which is uh completely 2D videos so we haven't had a chance to actually think about how do we apply the same concepts of uh annotation anchors to the 3D objects or maybe three-dimensional computational objects such as mesh or blend so actually yeah at the moment we don't have a good answer for that question to be honest but it it'll be a very interesting problem to tackle maybe in the future great thank you all so much this was a wonderful presentation um thank you to Kelly and Kyungmin um and thank you to everyone who joined us for the call today just a reminder that we'll be making a recording available um in the weeks after the conference uh so keep an eye out for that um and again thanks all and we will see you at the next session thank you thank you bye everyone bye