 I'm going to give a demo of how I process the videos for code refinery. This is the Python for scientific computing course, actually. Here I am in this dreaming machine. I will change to the place. I will copy or copy 09. So that's today. I copy it to the raw law day three dot dash OBS dot KV. And then if I do get status, it shows a new file, I will get annex add the raw file. It takes a minute while it's check summing it. And now it's committed. I can do get annex sync dot content. And this automatically makes the commit, so I don't need to do anymore. So from the other systems it's connected to, which is our cluster Triton, it knows it should automatically copy the day three OBS file to Triton. And now it's distributing this information around. Now I can do get status. It's clean. If I do a list of the raw data, we see it's a sim link. And that's how I know that's it. It is in good. So I will log out. I will SSH to the cluster, scratch video processing. Here I will also do a get sync content. Of course, I have to module load get annex first. Get annexing content. It says. Okay. And if I here. So now the raw video is on the cluster. So from here, I have a make file, which can make the new subtitles. I think I need to module load whisper first. And then I can do make SRT. And it started processing. So yes, I see it's doing the day three video. So now I will pause the recording and come back when it's done. Okay, so now it's done transcribing with whisper. It was actually pretty fast. The time just seems slow since I went and did other stuff in the meantime. If we do get status, we see there's this new thing. I can get annex add us the SRT file. Actually, first I will do something. First, I will edit this and now we'll remove some of the basic names from it to make it more private. Notice it says it's not a large file. It's adding the content to get repository. That's because of this attributes file, which is defining which count as large files. So large files, despite the name doesn't doesn't just mean large, but it means that it is going to be added to the annex instead of get itself. So with that being said, we can do a normal get commit. Add subtitles day three and then get push get annexing dash dash content would work just as well. So with that being said, let's come to my own computer first. First though, let's look and see what we see on GitHub. So this is the repository. It is code refinery slash video processing. If we look at the code, we can see the stuff we go to this course. We go to the raw videos and we see the day three raw videos there. If we click on it, we see it is just a sim link that's broken. So get annex distributes the data separately. So just because this is a GitHub, the actual raw video is not available to anyone unless you have that extra access. And the subtitle file is well, it's like you expect from a subtitle file. So now here I am on my own laptop computer. I will use a program called subtitle editor on the. So it's not on my computer yet. So what do I need to do? Let's do a git pool. I could also do get annexing dash dash content. So subtitle editor has started and oh, what do you know? There's no preview. So I want that. So we can close this for now. And let's do a git annex, git and for instance, com 23 raw day three obs dot mkv. And we wait a second. So this being my laptop, it's linked to our cluster as a way of distributing the files. So it goes there directly gets it and downloads it. I guess I could have shown how the this file, this link was a broken symbolic link before I did this. And after that it will be a non broken symbolic link. So let's run subtitle editor again. And what do you know we have the preview. So since most of this is going to be taken, I will open it up and basically scroll through. I really quickly read things and I see what needs to be adjusted. I'm not even bothering reading everything right now. I'll do really minor fix ups. This is actually more minor than I would normally do. Like you see, I'm just really scrolling through here. I mean, if it's normal text, it doesn't really matter if it's not perfect. So I will focus on things like where the commands are. Yeah, okay, I'm going to scroll through and find a technical episode. So you see like here this is control shift B that's readable control V might do something. This is all fine. Let's find where there's some file names. Look, this says arc parse, but I know that should be our parse. So this is probably happening a lot. So I will do a find and let's do some replacements. Place, place, place. Okay, good. And now I might have lost where my place was in the file. So maybe that wasn't worth it. Okay, this didn't get it. Okay, let's scroll down. Let's find an example with CMS. I wonder what that is. I have a shortcut. Okay, I can't figure out what that's supposed to be. But I bound the plus key where if I push plus it will play the video file at the location for the subtitle. I'll come back and look at this later since I'm just demonstrating it for you. So things that are often wrong are like the Unix command names or dashes, things like that. Or names of different modules and so on. So anyway, I guess you get the idea. I'm going to pause the video and we will return at the next part. Okay, so the other part that we can do in parallel is the cutting. So I will open this file. So this YAML file defines some basic stuff about the workshop here, a workshop description. Okay, let's make that more readable. Okay, workshop description stuff like that, the input files, and then for each input file, an output to be generated from it, title and description, and the edit list. So we see start at this time and end at this time. And next for the intro, we see similar output, title, description, edit list. Here we have times of points in the videos from the original. These will become table of contents entries mapped into the time range from relative to the start time. If we go down, we see some of them have breaks in the middle. So I will scroll on down to day three. So using Emacs, I will uncomment the first part. I will verify, yeah, this looks the same. Now I'm trying to call it day 3.1. Actually, we want an intro, so because we had an interesting icebreaker discussion today. So I will copy this icebreaker from day one and bring it down and insert it in day three. Day 3.1 icebreaker, 3.1. So there I gave a little description. I'll probably modify it a bit more after I stop, and then we've got the start and end points. So I remove these. I will make a new terminal here, mpv, come here. So mpv with hrseq, it allows me to seek more precisely in there and not just to the keyframes. hrseq, Python for ski comp 23, raw, day 3, ovs.mkv. So this starts up. I will make it a bit smaller. And I can use the arrow keys to quickly scroll through. Oh, let's display the timestamps. And here we go. And notice it has the subtitles on here. So I can use that. So I'll find the beginning. It will play. I can increase the speed some. Okay, you can't hear what video is going on, but I can. So, okay, this seems like a good starting point. So here I have a key binding so I can copy the start time. And I paste it there. Keep scrolling. Okay, that's too far. Here's where I want to start. So here and there I've processed icebreaker. So for scripts, this is incremented today, 3.2 now. I'll read and make sure this is basically the same. Yeah, seems okay. Scripts. And here we go. And these pretty well match up with what we did, the sections of the lesson. So I get the start and end times and I would insert these timestamps. And with that, I will see you later once this is done. Here is an example of a fully done lesson scripts input output and all of these times are now placed here. Okay, so I'm not done with everything yet. But I'm close enough, I will start the encoding. So I will minimize this. Git status shows that the file has been modified. I will get at it. I will get committed. Give it some message. And I will get annex saying content. And while that's going, I will open another terminal and I will connect my home computer. So the reason I'm using my home computer is that the ffm peg I saw on the cluster didn't work very well. And that didn't work very well somehow. And my laptop's not powerful enough. My home computer definitely is. So notice here, this git annexing content is already copying the raw day three to my home computer named Ramanujan. So I will see the video processing. I will wait for that to be done. I will get annexing content and it's pushing and pulling from all the origins it knows of so from GitHub and from the cluster. Python for ski comp. 2023. And what do you know so the raw day three OBS is already on here. So what do I do now. I will do tail or grab. I have a virtual environment that will activate. No, so that virtual environment isn't here ffm peg is just installed. So I will grab. I will find ffm peg from the bash history. I will copy it. So what's it it's ffm peg edit list, which is the program that does the cutting. This is the input YAML file. This is the input directory that has the video files. And dash O out means save to the directory. Re-encode means to actual re-encoding eight cores. And then I will limit it to day three. And let's add a list to it. So this is what will be encoded now. So let's do it. May as well run time there. So there it goes. I will be back later when it's done. But in the meantime, I will keep processing the day three edit lists for the last sessions. Okay, so now I'm done with processing the last video. I will minimize. Get status as usual. I will edit. Commit. Okay. Let's get push it. And the encoding is still going on. I see it's on lesson three now. And I will go back to subtitle editing. So I will open this up and basically start at the top and systematically go through and scan through every line. Hopefully it doesn't take too long. I'm still processing subtitles, but I just saw that the last the encoding is done. So let's sync. I had already synced the new ffmpeg edit list there. I mean, I've already committed the new edit list. So this went up to dependencies. Oh, and it took 38 real minutes and this is very high quality encoding. So doing the list, we see that these two lessons are left. So I tell it I want to re-encode these last two parts of last two sessions of day five with the dash L option. Okay, it's going. I will be back when it is done. Okay, the subtitle processing is done. I will add it all to get at all those changes. And let's push for good measure. So this last two lessons of the video are still processing. But while that's going on, let's do one more thing here. I will activate a virtual environment change to here. Okay, here we go. So I will run one of these commands. So this is ffmpeg edit list. Python, fiskikomp, the YAML file, this directory given out file. The dash c means check, which means it doesn't actually encode in any videos, but it does recreate the info and subtitle files, and that's what we'll see. And we want to do it for day three with the dash L option and we want to make subtitles. So I run this and note that all of this can happen even when the videos are not yet processed. If we see git status, we see all of these day three things done. So if we see now there's an info.txt file for every video, and this is what can be uploaded as the YouTube description. So it has the video title, the workshop title, the video description, and then the generic workshop description, which is at the bottom. And the subtitle file is, well, the subtitles. So it used the raw subtitle file. It used this raw subtitle file in Extractic just the segment corresponding to that part. So with that being said, let's pause again and wait for the video processing to be done. Actually, I realized we can parallelize even more because some of these videos are done. So let's begin processing those already. Let's see, I will connect my home computer again here. So I can git annex add. So I will add all of the process video files, but not all the info.txt files, because those are done on my computer already. I could do git add instead of git annex add, and it would still know that there are large files that should be annexed instead of git committed. Okay, I hope git annexing.content will not add new files. Yeah, so these files have been created, and it's copying to the cluster. Okay, with a brief pause in the recording, it is done. Okay, let's move this out of the way and come back to this computer. So I may as well git add all the stuff in out now. It won't conflict with anything. I'll git annex sync.content. It pushes in pools. It does a lot of syncing. If I list out, we see that the day three files are still broken links, and that's because git annex hasn't pulled them yet, because the git annex wanted content for this computer is just present. So that means it won't pull in stuff that's not requested. Git annex, git out. And it comes quickly because this is on the same network as the cluster. Notice how as I'm using git annex here, I'm not even really thinking about where stuff should be moved and where it should go. This is all defined in terms of these wanted content expressions. So basically a git annex sync dash dash content, and then git annex git when necessary moves everything exactly where I want it to be. So now I'm ready to start YouTube uploading. Okay, so here I am in YouTube. I push create, upload videos. If I come to this alt directory, we see we can start with the MKV for icebreaker, upload. I come here. I use less on... I get the .info.txt file. Copy this. Let's go crazy. Just the browser with some paste. The first line is the title. And this is something which someone could automate someday. I select the playlist. Most of the rest of these options have all been set in YouTube, the channel defaults somewhere. I set the recording date as today. Like I said, everything else is good for defaults. I add subtitles, upload file with timing. I find 3.1 icebreaker subtitles.srt. And I will play a little bit. And I'm just making sure that the transition is right. Yeah, it's good. You didn't hear the audio of the video, but that's fine. I click done. And then I do next. It's checking. And next. And then push publish. And so that the videos appear. Oh, is it done? Yeah, it's done. So sometimes whenever this is going, it takes a while for the processing and the checking to be done. In which case I wait for that to finish before I start uploading the next video. And that's just to make sure that it gets uploaded and stays in the correct order, like reverse chronological order. Sometimes if I start uploading the next one, but the next one finishes before the previous one, then the videos will be in a wrong order. Okay, with that being said, the encoding here is done. So git annex. Let's add the remaining MKV files. Sync it, which includes copying to the cluster. Okay, since I paused, that was pretty fast. There's one other thing we can do. So I can git annex copy out the output video files to a loss. And I think it happens from this computer. But I would have thought these were all already there. I wonder why it's copying again. Anyway, this is putting them on the CSC object storage. And then other people can pull them. Okay, it's done copying Dallas. And I realize what happened. So these it said it's copying, but really was already there. So it's only copying the new things. And we could use git annex preferred content expression. So it would automatically copy stuff to there. Okay, let's see. So now that I'm here, git status. So I had it clean the text files there. I will tell it to sync content and down here on my laptop. And with this said. So here I'm going to get the these two last video files. But I'm going to tell say want to get it from a loss. And here I haven't authenticated to the loss object storage. So it's getting it from the public copy. So it's downloading via HTTP doing checksums and all that stuff. It looks different. And that's because it's using the data lake. Okay, so I will basically go and continue uploading the things to YouTube. And maybe that's all with this. I've processed all the videos with not that much effort overall. So what have so now it's all done and uploaded and configured. It is now a little bit less than four hours after the course is done. And three hours, well three hours of course videos are uploaded, which is probably about two hours of actual content. So what did we learn and do here? So the video came and I used git and git annex to distribute it around. I used plain text files to configure the editing and distribute it around. Which would allow git to share that data and also multiple people to work in it and to edit and work at the same time. So the encoding, the subtitle fixing and the editing could all be happening at the same time. Well, encoding after each part was already edited. And a lot of this could be automated. So I could improve the make file some more. So that way it is so even more is automatic. Right now there's a fair amount that still isn't. So I think this is, well, the system has promise. It's not perfect because there's a lot of manual stuff to do still. But yeah, that's the main part. It allows us to publish these videos far faster and with less work than almost any other courses or workshops I've seen. Thanks and bye.