 Which is on his version 5. So thank you very much Luca for being here and please go ahead Yeah, thanks Marco And yeah, yeah, thanks for having me and hi everyone. So I'm Luca And today I will be talking about the open vision computer and specifically the fifth iteration So the open vision computer is a project that has been going on for quite some time as you can see like I included for for sake of For like to yeah, I feel like you know the list of people who work on this over the years I sort of like we we are currently on the iteration 5 So we'll give a brief overview of what the platform is and like the history before getting a bit deeper into into the current version So what's the what's the motivation like why did we build first of all an open source vision system computer vision system? so the idea is that Like some we How to say Like sometimes off the shelf vision systems are great because I can offer some for let's say all you want All you want is like some stereo vision And you want like a very simple API to get some information out of the camera. So in that case You can just buy an off the shelf camera But in some cases off the shelf doesn't work And this is this is for most cases where you want a very specific application Where maybe maybe is you have like very very critical size weight and power constraints or very critical like latency constraints or computational constraints or even Hardware constraints in in case you want to integrate some specific hardware So we embark like a long time ago in this quest to create an open source like from both a hardware firmware and software point of view computer vision system So the the project has been going on for about seven years by now and like those are like all the applications So in the first five years, we have been focusing on the FLA program. So you can see like Over here like one of our one of our operation computers flying on a drone And this was in collaboration with the University of Pennsylvania So in this case the constraint was Like having like good global shutter images and a very like small And a very small and light vision system And while in the last two years we we moved into a different direction Which is looking into mobile manipulator sensor heads So in this case since we are working with a ground robot We don't really have a we're not really constrained in a sense of size or weight But we in this case we just wanted to look into flexibility and having a and like expendability of the platform So in the beginning in the beginning like, you know, the FLA era from obviously one to obviously three So the purpose of the of the FLA program is to Fast lightweight autonomy so to fly small and light drones quickly through clutter space and to do like autonomous navigation So we we went like through three iterations over here the first two were very based on the NVIDIA TX2 platforms, so they were like the PGA platforms that you connect to the NVIDIA TX2 But then we decided that it's not great to be bound to be very specific to the NVIDIA TX2 platform because Like NVIDIA could for example like release a new platform Suddenly we like a very different like hardware connection or like even software software support Which which did happen over the years, for example the Xavier or the Jackson platforms So we went into OVC 3 which is which is what a change of direction and is the direction that we have nowadays Where OVC is at the end of the day just a USB device that you connect to your computer And then most of the most of our OVC platforms are like FPGA based So the idea is that especially in the case of the FLA program We want to offload as much computation as possible to to to the camera to to help Like downstream users of the desktop users of the computer vision system to do whatever they need to do So in this case, we are we actually developed an open source and open Feature detection algorithm running in FPGA that runs basically as pixels coming from the camera This is running real time. So you have basically like zero latency and you you run this This feature detector called KST detector won't go too far into it But basically you look at a circle around around every pixel and you try to look for corners So what happens is that you you have in real time You have output of images and features in that images that are very useful for example for Slam or like you know for localization and mapping Applications and then since in both these features you can also like use the images With some like more of the shelf algorithms like for example Like for example, you can you can use them you can use some like your implementation to run like some some like object detection or like the segmentation So so this this was like the typical application like some computation in the pg and some computation in for example, I would I be the platform So I won't go yeah like this I won't go too much into detail But this was like, you know the OVC one and this so wherever we have like, you know synchronized Synchronized cameras and like outputs output of features and this is an example of autonomous flight happening Happening using our vision system with the with the pga like, you know, so so they run like stereo matching on the GPU and then Also around localization and mapping and then you can like you know beyond map and to autonomous flight And then as I said, we moved into OVC 3 so obviously 3 the idea is that now we are we don't directly attach to any Nvidia platform But now we are just like a USB C device So any like regardless of what machine people are running, they can just connect to our camera through USB They can like, you know get synchronized images and like do then do whatever they want from from the downstream point of view And again, like, you know, this this was used again like with the grass-flabbed University of Pennsylvania to like To do that so so this is an example of like data stream coming from our sensor Where like where like it's using the it's doing again like localization and mapping based on the based on the cameras to like navigate autonomously in this very tricky environment because it's a forest. So there's a lot of repetition. There's a lot of Like it's it's not very easy to navigate around And like, you know, they do tree detection to localize and to avoid trees and to build maps Anyway, so after after we see three we moved into this this new era of obviously for so again, like in this case, we are not really We are not really constrained by size weight and power as you would usually be on a flying platform because this is a ground vehicle So right now the idea is to have a very flexible computer vision system where you can Connect whatever image sensor you want and you will just like integrate all of them And then we also want to achieve a low low latency vision to Because low latency is critical for like a fast acting robots and like for effective like control loops So so we we are like this these three targets Which were Which were ease of use high bandwidth and low latency So for ease of use we decided to go for the Jetson for the Jetson Xavier and X platform and the idea is that Like programming GPUs is a lot easier easier than programming FPGAs because there is a lot of support out there for for GPU algorithms And it's also like a lot easier because it is more like, you know, maybe you have a like a C or C++ kind of language compared to hardware description language um Then the Jetson the Jetson platform also has Mipi input so you can like feed camera data straight into the into the Jetson platform so we developed this obviously for which is a A custom like Xavier and X carrier board With a with a high end MCU microcontroller So the microcontroller is used to for the hardware time tasks for example sensor synchronization While the Xavier and X is used for to fetch images and maybe to run some computer vision algorithm and return them to the user space to the user And then again, we are also very interested in like high bandwidth So we looked into like 10 gigabit ethernet and we could go all the way to like, you know, 9.5 gigabits per second of like effective bandwidth But this this require this requires like a 10 gigabit ethernet port And that's that's something that is not very common. So we decided to again, like switch to something simpler and just do Like a single USB three connection where like that that you can plug into your machine Which gives us about one quarter of the bed is like 2.5 gigabits And the low latency set list the part where it just didn't go very well So the problem is that specifically for the nvidia platform You need to use their own proprietary api to fetch images from the sensors And then you can't get away from From like the frame buffering that is happening. So what happens is actually every time The days there is like a three to four frame buffer in the json. So every time you read a frame, you're actually reading A frame that was captured Three frames in the past So it's okay for if you have a very high frame rate sensor But since we were running like 15 hertz this resulted in almost 300 milliseconds of latency Uh, which which was totally unacceptable. So we decided to like change Change approach So we went for obviously five Which is a mix of a mix of the two and which is a like a FPGA based So it's based on the zinc platforms and from silings and it's um, and it's a also like a fully configurable platform So we we we managed to about double the bandwidth and then the approach is that right now We still do usb, uh, but it's a double usb Double like five gigabit connection to an onboard usb hub Which then exposes a single like 10 gigabit connection to the user So the user just needs to to have a 10 gigabit, uh, usb port Connect one wire and then they will get like up to 6.5 to 7 gigabit per second data from the from the from all the images that obviously five is running And then for the for the latency, it's um It's even more interesting because now we have full control over what's happening because we have very low level hardware control Uh, so we we can go all the way to not even have a full frame buffer But like send image while it's still coming from the sensors So we managed in this way we managed to reduce the latency back by 10 So so now the sensor to user space latency was about like, you know 27 milliseconds um And then again, like it's it's a fully expandable platform. So we we provide like six connectors and they follow the somewhat standard raspberry pi camera pinout um So so the idea that users can like plug whatever configuration of camera they want they can like synchronize them integrate them And output them over a single over a single usb connection Um, so what is what is the architecture like? So I won't go too much into into detail But basically we're using this this xilinx chip Which is uh, which is puts together a multi core processor in this case It's an arm cortex processor together with fpga Together with a programmable logic with an fpga and the two are like very highly coupled So you have like you can have interrupts going between the two or you you have like a high speed interconnect to external memory to for example share the images and the the interesting part is that xilinx in general provides up quite a lot of like open source Uh, ip's which are like, you know boxes of vhd code that do the most common Common operations you might want to do So actually the interesting part is that thanks to all this free not open source, but free free for use ip's provides actually there is not a single line of vhdl or very That was written to to have obviously five up and running. So it's all pure Linux like c++ or Python work really Um, so okay now but like the question is like, you know, how does it actually work? Um Again, I won't I won't go too much into this. This is just to to give a To give an idea of like how of like the complexity of the of the platform So there are a lot of like steps happening because again this system on a chip is a very complex complex system But but again, like, you know, the the tool chain takes care of most of the work for you So you just it's one of those cases where you don't really need to do a lot of work But you need to know exactly where to where to go and do it and do it So for example, we we had like we had our own custom FPGA logic FPGA logic and then a tiny bit of customization on the linux kernel Kernel side and the rest is just everything is done from a user space that young So you you don't need to have like too much knowledge of very low level hardware to work with this So how does it work? Um a very high level idea is that we have first of all We have like we can run up to six images in parallel. So every imager Uses an i2c interface to like configure it And then it outputs data through a mp interface Then the then the data I like is Received and decoded send to like a dma So a block to write into into shed memory and then the data will be written to external memory And then the communication through the ARM processor happens through interrupts. So whenever a DMA finished writing a whole a whole frame He would send an interrupt to the processor saying, okay, I finished a frame now You can read it and just read from this area of memory But also for the for the low latency pipeline. I mentioned before You can just configure it to only receive a part of a frame and then Notify the processor. Okay. I received half a frame now You can start sending data as fast as possible And so so this this will be this is also like communicated to the processor So again, like, you know a very high level overview of the software. So We do like full like since we want like a very configurable Since we want a very configurable Platform we actually do detection at runtime of the imagers. So based on every every every sensor has maybe their own I2C address and they have maybe our register with a very with a special value So we use these to to check at at boot time what what sensors were connected We initialize all of them and then we assign them to Since we have these two parallel usb connections, we assign them to the usb connection with the most available bandwidth And then depending on what the user configured we can either do a full frame buffer, which is a Conventional loop where you wait for an image You send the image and then as soon as everything all the images are done, you'll wait for images again Or you can do this very low latency Application where you just wait for a certain number of lines from one specific imager And then you do this overlap receiving and transmitting for all the all the images that you have Depending on whether you want ease of use or you want like extreme low latency So again, I cannot this these two different approaches. So if you want extreme low latency Again, I can you receive a few lines from an image and then you trigger all the images But this is great because you have the almost the lowest possible latencies I would say like much lower than any any commercial system could probably have probably But then if you don't really know what you are doing you risk receiving corrupted frames because you must be sure that That you are you that when you are sending data you are send like you're sending the right data Basically, you are not sending half a frame in the past half a frame in the future While if you want a more conventional simple application You can just wait for a full image to be received and then you send it over And this is a lot more flexible because you don't need to know exactly how every sensor works because you You just send all the data when you receive the full image. So you don't risk corrupting frames And then but then again, it's it's a it's a bit higher latency Because now you have a full frame, but then again, it's nothing compared to for example What the Xavier platform had which was a three to four frame latency. So it's reduced by a factor of three to four Um, and again, I wanted to give a bit of an acknowledgement because like the this This wouldn't be possible if that there wasn't this this retired pga engineer japanese fbj engineer that wrote this This user this kernel module to allow users to map data into user space And it's it's like, you know, this is a very relevant xkcd because it's the one person that has been Thanklessly maintaining it this this this single piece of code And if you look at all the issues on github, like a lot of people are really relying on what this this person has been doing It's free time. So If you're if you're there listening, thanks a lot. Absolutely um And I cannot be a bit of like Just to show like what we have actually done with this platform so first of all a bit a bit of introduction that um, The the imager the imager's hardware world is traditionally a very very very close source Close source world So we we work with a lot of different images. So we will the first of all like the simplest one oh no The simplest one is the raspberry pi camera that you see here on the left the pi camera v2 I believe is called And this is the the one that we provide as an open source reference implementation because there are a lot of Implementations for this camera. So we just We took a mix of like different Uh, we took a mix of like different configurations from different like gp l implementation gps slash bsd implementations And then we we put one together And then we we experimented with a lot of other sensors over here. So we experimented with Two megapixel global shutter sensor. So which is very good for fast moving applications like drones And then we like a five megapixel hdr sensor over here, which is good in cases For in cases you need hdr. So if you have High dynamic range, so if you have like a very dark and a very bright part of the scene And also the imx490 which is also like an even better hdr hdr sensor designed for automotive applications but sadly again because because the The the sensor the image sensor market is generally very conservative All all of this work here was done under nda So it cannot be shared but we still provide a reference implementation for the open source Just verify camera that people can can reference for their own for their own sensors And then finally together with all the image sensor, we also looked at time of flight Which is Time of flight is is basically just a way to use a like you use a laser to calculate how far Objects are so you shine a laser and then you do some calculations based on The how long it takes to the for the for the lights to come back And Voila And then this brings us to our final grand unified demo unit Where we just integrated a bunch of all these different sensors together. We have two hdr We have an hdr stereo pair that we have three global shutter images pointing in three different directions And we have a depth camera to do like time of flight pointing forward And then we have all of them, you know, all of them like synchronized they work at the same time like they capture Synchronized images they send it to the user space So the idea is that then users have like this this full set of synchronized images and they can like From a bunch of arbitrary sensors and they can like do whatever whatever they want with them Yes, and this is me from my desk like showing yes Doing doing an example of this. So we have like the two hdr sensors the time of flight sensor and the three Global shutter sensor So you can see, for example, like global shutters have some problem when there's a lot of like bright and dark in the image But it's not an issue for like hdr sensors. So the idea of the heterogeneous set of sensors So, yeah, uh lessons learned is that yeah, again, like you know every application is very different as they're very different needs So it's it's good to have to have flexibility to like be able to plug whatever sense of complication you have and like do whatever you want with the images and also You need to have like full control of the of the data flow to have low latency And again, like, you know, these days fabrication has probably most of hardware people know is very hard. So it's like um The lead time is actually making like very hard to like keep developing on this platform at least for the past one. So, yeah So, yeah, we are we are looking in the future for like, you know Since since since we have hardware controllable the pictures we are looking at For like having hardware based Privacy so for example for like, you know, detection or like people detection Uh, or like pedestrian detection and so on so like to to make sure that Since it's a fully open source platform And like everyone knows what's happening at every point Then it's like a fully guaranteed to like ensure the privacy of the of the people That are captured by the camera Um, so yeah, if you want to have a look you can go to the link over here open vision computer and yeah, there it is Thanks all of you