 Welcome to another edition of RCE. Again, I'm your host, Brock Palin. You can find us online at rce-cast.com. On there, you can find a link to a list of all the existing shows that we've already finished. You can download the full MP3, subscribe by iTunes, RSS feed, and all that stuff. You can also nominate topics for future shows. We're always looking for topics coming in from out there in the community and what people would like to see. Today I have a guest from, I believe, Sandia National Lab. He can correct me. He's a developer for the piece of software and otherwise known as Iced Tea, which I believe is a parallel image compositing engine. I know nothing about these things. I'll let him introduce that and he can correct me. Ken, why don't you go ahead and introduce yourself? All right, well, like you said, my name's Ken Morland from Sandia National Laboratories. You got that correct. My experience is in parallel distributed in scalable graphics and visualization. I got my start in this field, about 10 years ago, when we were working on Iced Tea itself and I've gone on to work on very large distributed systems. I am a regular developer, for example, of the pair of you, large scale visualization system. Okay, so why don't you go ahead and tell us what Iced Tea is? All right, well, as you basically alluded to, Iced Tea is a library to assist in large scale distributed memory parallel rendering. The name Iced Tea is an acronym for Image Composition Engine for Tiles, and in its core is a set of very fast image composition algorithms. Now the basic idea is to take a collection of images, each with partial information and combine them to one complete image. And this is a standard technique for parallel rendering for close to two decades now. What makes Iced Tea truly unique is that we can perform this image composition on large format tile displays. And I know of no other system that is capable of doing that. Okay, so this is one thing where I ran into trouble searching for this because there happens to be a certain wrapper out there known as Iced Tea. What is the official spelling of Iced Tea? And I know there's an entire paragraph on the Iced Tea website about this. Yeah, so it's a long and rather stupid story. So originally Iced Tea wasn't called Iced Tea, it was called MTIC for Multi-Pile Image Compositor, or something, feel like that. And one of my colleagues came up to me and said, you've got to rename it because no one can remember that name. And at the time, for some reason, I thought it had to be an acronym that actually made sense. So I thought very long and hard about something that was pronounceable. And I came up with Iced Tea for Image Composition Engine for tiles. And originally it was all caps of the dash. And for some reason I saw people using CamelCase, probably because you can't use a dash in identifiers in C code. So I switched to using the CamelCase, but you still see both in practice. You use them in Googles very well. So if you want to look for it, you should type my name in and then type Iced Tea and it'll show up. Okay. So how did you exactly, where did the idea for Iced Tea come around? Was it just the age of the existing algorithms out there that didn't map to newer systems? No. So what happened was this group began experimenting with parallel rendering in the late 1990s. And at the time, most of the large scale visualization was done in what we call big iron SGI machines. These big expensive boxes with lots of memory, lots of cores, and most important, lots of graphic pipes put together in a computer that was about the size of a refrigerator. But because they're expensive, there's this growing movement to use much cheaper consumer level graphics cards. And the best part about it was the performance of this consumer level graphics cards was growing at an astonishing rate. So there's a plethora of research into the software solutions for driving these distributed graphics. Now, we were being funded by DOE ASCII. And ASCII was and still is about large scale computation to support science. So our goal was rendering capacious amounts of geometry that was generated by these large scale computations. So we did a lot of research that analyzed the behavior of the different parallel rendering approaches that were available. There are different techniques for first sort last, but we quickly concluded that the sort last algorithms were the way to go. The way the sort last algorithms work is that the rendering primitives are statically assigned to processes and all the processes render a full-size image with partial geometry. Then these images can be combined together to form a single correct image. So the great thing about this is that it has wonderful scalability, which is what we're most concerned with, this large amount of geometry that you want to render. And simply adding geometry amortizes the cost of the sort last parallel rendering algorithm. However, the biggest drawback, both back then and today, was that bigger image means more overhead. So at the time when we started to perform sort last on 60 million pixels, it was really a crazy idea. But the idea what we had behind IST that nobody else seemed, well, very few people actually seem to address was that geometry is seldom uniformly distributed spatially. Geometry tends to implement spatial coherency, which simply means that if the geometry is close together in memory, it's probably also close together spatially. And when you have spatial coherence, that means when one process renders its geometry, it's gonna render its data in a very small part of the screen. And that's gonna leave a lot of the pixels blank and they can be ignored. When you're in a tile display, that means a lot of the tiles can just be thrown away. So the algorithms in IST are designed to throw away these blank tiles and balance the compositing of the remaining sparse set of tiles. So I've seen the sort first, sort last stuff up here in a, what's the piece of it, or chromium, which I believe is now a defunct project. But that was actually for GL acceleration rendering. Is IST involved in the hardware acceleration pipeline or is this before you even get to those calls? It's actually after. So IST stands back and lets you render the geometry in any method that is most applicable to your application. Most typically today that means using something like OpenGL to render something on a GPU. After that rendering is finished, remember the rendering is done on a partial set of geometry. You grab these images back and then IST works generally at the CPU level and message passion interface to combine these images into a single cohesive unit that you can show to the user. How does IST know where things are placed in memory than if the geometry is already done? I guess I'm not quite following what's going on here. Okay, the way a sort last algorithm works is that it doesn't really matter where the geometry is. So each unit renders, each process renders its geometry locally to an image. And not only for rendering an opaque surface, you'll render both the colors that you'll see on the screen, but you'll also render a Z-buffer, which is for each pixel how far away the nearest polygon was. Then you read back the color and its depth and then you can what you call composite the images together so in a pixel by pixel basis, you look at the Z-buffer and you see which pixel is closest and you color by that color. And of course, IST's algorithms do this in a parallel nature, so they'll be doing this over so many processes, passing messages back and forth. So you mentioned that you work on Paraview and actually a Paraview and a similar piece of software called Visit is where I first learned about IST. You can use those in a tile environment without IST. Are they just using stuff that comes with VTK or what exactly is, so you described what IST does, what are they doing in these other methods that is so much slower than what IST is doing? All right, I can't speak to Visit, but I can speak to Paraview. When you use Paraview now, it's always using IST for its tile display rendering. There is a mode where IST will just go behind the covers and let you distribute the geometry across all of everything to make things go faster. But in general, the only parallel rendering implemented in Paraview is IST and that's because it's the fastest algorithm we have available. Okay, okay, so what I was reading must have just been referring to IST becoming standard and it has stopped calling it IST. Yes, it was like because IST has taken over all the parallel rendering in Paraview, oftentimes people don't even bother to say it anymore. Okay, so the next thing I had was what was the relationship between IST, Paraview and Visit. I didn't know that you worked on Paraview before now. So of course, Paraview is the primary consumer of IST or is there other products out there that use IST? So Paraview was certainly the first and that's simply because we went from developing rendering algorithms like IST to working on full systems like Paraview. So when we started developing within Paraview, we came insistent on using efficient rendering libraries and of course, we liked IST because not only was it ours, but it's again the fastest one we had available. Since then, I know of a few other projects that started to work with it. As you mentioned, Visit has taken in and has now enabled that for use on their tile displays. I do know of at least one other project from the French Atomic Energy Commission that's using IST and some of their in-house visualization tools. Okay, but IST is something that's not really user-visible. Someone using one of these products would just see the composite on a tile display just go that much faster. Yeah, that's correct. Likewise for, for example, OpenGL and MPI, the user really shouldn't be too concerned with the libraries underneath just that the application that they're directly interfacing with works. So here's a question. Would you ever use IST on a traditional compute cluster, like a headless cluster, something driving a tile display? Absolutely, matter of fact, we at San Diego use it for that method all the time. As I mentioned before, at least in Paraview, IST is the only parallel rendering algorithm at all. We use it not only for tile displays, we'll use it for what we call desktop delivery. And that's when you have connected from your desktop to some remote rendering cluster, and it's in parallel generating images and sending them back to the client. These images are comparatively small, and we're talking a million pixels. And IST is used for that compositing as well. Okay, so would it also be involved in like a static rendering of an image? I mean, instead of sending it to the user display, you just dump it to disk, right? Could do it for that, yes. Okay. So you already said that IST comes into play after the GPU, so IST does not depend on any GPU specific Mesa or GL implementation. Yeah, correct. The current API depends on OpenGL, but fairly superficially. As long as you have an OpenGL context, IST will work fine. And X, GPU, Mesa, all of these will provide an OpenGL context. Now, as I mentioned, this is kind of a superficial dependency. So in the future, we're planning to release this restriction. You'll probably still use OpenGL most of the time because that's the easiest rendering library to use. But you can potentially use something different. For example, Mant as a ray casting solution. Okay, yeah, that would actually be interesting to put this behind a CPU ray tracer instead. Okay, so how complicated actually is IST from someone who wanted to implement it in their piece of software? If someone wanted to use IST on a desktop graphical app that was rendering something to take advantage of modern multi-core processors, so they wanted some sort of parallelness in their application, is IST where they should start or should they start with something else and then move to IST? Well, that depends on your situation. So for example, as I mentioned, Manta is itself a multi-core rendering library. So if you're just dealing with one multi-core system, it wouldn't make sense to put IST on top of that. IST was designed to be distributed memory. It works just fine on an SMP machine. You just treat it as a distributed memory system. But only if your application also treated in that manner. For example, you could run Paraview as an MPI job on an SMP multi-core machine. You would just treat it as a distributed memory system. Now, how hard is it to integrate IST into an application? It's typically pretty straightforward because IST doesn't enforce any type of data management or rendering styles. As I mentioned, you can use whatever type of rendering you have available and send the images back into IST. It does this through rendering callback. So whenever IST needs an image, it just calls this rendering function that you provided and gets an image back. Some people are probably wondering, but you already said it relies on GL superficially. You can run IST using a software implementation of GL for rendering, right? Like if you were running it on a cluster without GPUs. Yes, and we do that all the time. That's actually a pretty common use of some of these applications. Here at Sandia, we have several clusters that weren't specifically designed to be visualization clusters. So they don't have GPU hardware. And we do exactly what you said. We use the Mesa software rendering library and IST with that and it works just fine. So you mentioned IST is really focused for between boxes, distributed memory. Is there many applications outside traditional HPC where you see IST as being useful? IST IST is well suited for any application which deals with large meshes that are rendered and where detailed matters. And if you're talking about a really large-scale system, we're gonna be talking about distributed memory anyway. So this is where IST fits very well. Other possible application areas I can think of, for example, oil and gas has large geological surveys along with simulations and they could benefit. In fact, I've seen some examples of oil and gas actually using IST. I can imagine things like weather and climate studies or medical applications. Like I said, pretty much anywhere that IST is, excuse me, anywhere that you're dealing with large data and the detail matters. So right now something like a regular HD movie rendering, those are still small enough images every frame that IST wouldn't be beneficial for them. Sure, it could be beneficial. As I said before, we use IST for parallel rendering of smaller images, images that you send to your client. And an HDTV signal is just fine. You could render that as a single image or you could break it up into tiles if you want. You have that choice with IST. But we found that IST is very efficient even in the smaller images. It was rather interesting even though we specifically designed IST for tile displays. We find that we don't often use them for tile displays. You more often use them for a single display which we call the single tile display. So you found that you just made something that worked well in the large case. It also worked well in the small case. Yeah, so the optimizations that we did in IST to basically get rid of all the unused pixels worked just as well when you're dealing with a single image as they work when you're dealing with multiple tiles. So compared to other compositing methods, just how much faster is IST over some well-known methods? Well, as far as compositing is concerned, if you're dealing with small images, IST will revert to a binary swap algorithm which is a pretty common technique. But nevertheless, IST's implementation is efficient and makes good use of these data reduction techniques that I mentioned. If you're talking about a tile display, these algorithms are unique to IST and they're less comparable. If you're doing a naive approach using a sort last technique, you'll get a much greater speed up than that, like about 10 times possibly. Now that said, there are other techniques other than sort last that can be used in a tile display. For example, you mentioned Chromium and its predecessor, WireGL, which had very good sort first implementation for tile displays, which could get much, much faster framers than IST, particularly when you have really big images. But in the same approach, if you had a lot of data geometry, they would flow to a crawl. IST can handle much larger geometry than those techniques could handle. And this is why I specifically stated that IST is good if you have applications with lots of detail and the detail matters. So how well did, this is a question Jeff had, how well does MPI fit to your communication when working with IST? So message passing is a good fit for IST and an image composite in algorithms in general. So IST uses MPI for all communication. Technically, it's possible to swap out a different communication library, but nobody ever does. And what kind of hardware should someone be running this across is the type of messages that IST relies on short and latency sensitive or they large and sell them so they're bandwidth sensitive? Well, you could say that they're both. IST tries to pass large messages so you do want large bandwidth. But latency matters too because remember that we're talking about parallel rendering and if you have an interactive application, you wanna have this done in fractions of a second. So if you have a large latency on your interconnect, you're not gonna get very fast frame rates. So you wouldn't really wanna run this on a traditional like ethernet cluster, you'd wanna use some sort of high performance interconnect? No, your performance is gonna be very bad on your traditional ethernet, right? You wanna high speed interconnect. What exactly, you were distinguishing between a large geometry versus a large image. Could you explain on that a little more so it's a little clearer? I'm not very familiar with that. Okay, so large geometry is the input of what you're rendering. So you can imagine you're say rendering a building and you can implement the building with very small geometry, which would be a square for each wall and a square for the roof and that would be very small geometry. Or you could have a separate square for each one of the bricks in the building and have thousands of these bricks that you're all rendering at a time. That's the difference between the geometry size of the input to ICT. And of course, the image size is fairly straightforward. It's the total kind of pixels on your display. So you could have, as you mentioned, your HDTV image, it's 1080 by, not sorry, 1920 by 1080. Or you could pile these up so you could have 16 of those and add up the pixels. ICT was specifically designed to handle large in both sense, both large input geometries and large displays. It's a very difficult and computationally intensive problem. So what is then the largest, well, I don't know exactly what the metric for geometry would be, would that be the number of vertices or something like that? Or so. Usually you count pixels. The vertices is fairly comparative. Okay. So what's the largest geometry ever rendered using ICT versus what's the largest image ever rendered with ICT? It's been so long that I've actually counted geometry. I know back five, seven years ago, we were talking on orders of half a million polygons. And that actually seems pretty small to date. I know we've rendered geometries that come from over a trillion cells, but we run access service on those. So I think the actual number of polygons we rendered was smaller than that. But it's pretty close on the order of billions. As far as the display is concerned, I'm not exactly sure where the largest images ever created. I can tell you the ones that we used to create were around 63 million pixels because we had a very large display that had 63 million pixels. It was huge, it was 15 feet across. So even today that seems a little bit excessive because you kind of had to walk back and forth just to see the entire display. I don't know if anyone has ever rendered anything larger than that, it's possible, but there's no real limit on the image size that comes out of ICT because ICT is careful with its memory allocation, not to hold more than say two or three tiles worth of image data on any process at any one time. So as long as you have enough computers to display the tiles, ICT should be able to generate the image. So back with comparing ICT to Chromium, Chromium sat behind the application and intercepted the actual GL draw calls if I understood it correctly. ICT goes in the actual application you're running. What would you tell somebody who has an existing GL visualization app and they wanna run it on a display wall? Right, well, if they have a serial application, something like Chromium is perfect for that because we're not talking about huge amounts of geometry, most likely. So the Chromium approach, the sort first approach is going to be much more efficient than the ICT approach. Now, if they were claiming that they couldn't render very fast even on their local display because they had so much geometry overloading their application, they're gonna have to paralyze their application. And regardless of whether they used ICT or Chromium, that's probably gonna be the hardest part of your development. Inserting a parallel rendering library and pop it at is gonna be fairly straightforward. And also to actually use Chromium in a fully parallel manner. So Chromium, I believe you mentioned, also had a sort last algorithm in there, although that sort last algorithm doesn't work on the tile displays. If you were to use that, you would actually have to make modifications to your application to tell Chromium how to do certain things and also to drive Chromium in a parallel manner. So regardless, you're gonna have to make changes in order to make your basic application and data handling parallel. And then putting either ICT or Chromium on top of that is about the same amount of effort. So what are you seeing the future of these scientific scale visualization applications going? Are they, more and more of them are going the, it's got all the rendering and compositing technology in the app and not in the system? I'm not sure I understood the question between app and system. System level would be like Chromium where you run the app and then the system takes care of splitting up the geometry and sending it to different tiles or breaking up rendering between them. I think for the most part, the parallel rendering algorithms aren't gonna change dramatically. And they were originally designed for the most part 10 years ago. There's been tweaks here and there, but they haven't changed all that much and I don't expect them to change dramatically either. Now, whether you go for a system level or app level, I guess that depends on what your needs of your application are. I guess for ICT it doesn't, going into system level isn't really necessary because all of your parallel operations are after the system level is finished. Whereas for the Chromium approach it's really convenient to be able to do the sort first and intercept that at the system level. But as you scale up higher and higher, particularly with more and more input data, that approach simply doesn't scale with your input data. So it's simply not gonna be feasible. So since methods aren't changing, if someone has an existing parallel app that kicks out massive amounts of data and they want to scale up their visualization needs, should they just take something like Visit or Paraview that's an extendable parallel system already and just make a data reader for their format for those systems? That would probably be the most straightforward way of doing it, yes. Okay. Okay, so what's coming for the future of ICT? Well, the most exciting upcoming work is work that I'm doing with Pompaturka and with Kendall from Argonne National Laboratory. So they have a recent new sort last algorithm called Radex K that is supposed to outperform the sort last, particularly when you have a whole lot of cores. So we're gonna integrate the recent Radex K algorithm in ICT and compare it to the existing algorithms and then perform metrics in ICT for up to 100,000 cores. People are still trying to use Chromium because they have something and they just wanna blow it up on a wall and there's no nice way to do that currently. Sage, I believe is similar, but I've not investigated Sage enough for people to know if that's gonna work well. Yeah, I don't know a lot about Sage, but I have gotten several emails from one of the developers because he is integrating Sage into Paraview and the ICT code in Paraview. So ICT is doing the multi-tile compositing and they're taking Sage and streaming the images to a remote tile display. So it's kind of interesting work. Yeah, it seems like everything now you have to be written for a tile display to really take advantage of one. Chromium was a nice gap, but nobody is working on it, so it's kind of dead. Yeah, like I said, the Chromium, particularly its predecessor, wire GL, the idea being to take your serial application and just blow it up in a tile display was just a really cool idea. I mean, it worked amazingly well, but if your application is growing, particularly for scientific visualization where you have lots of data, that approach doesn't work. You have to parallelize your original input and then suddenly trying to parallelize on the system level OpenGL streams makes less sense. Well, Ken, thanks a lot. Is there a, what's the website for ICT and contact information? All right, so the ICT website is www.cs.unm.edu slash tilde k moral slash ICT and contact person to be me, K-M-O-R-E-L at SanDia.gov. Okay, Ken, thank you very much for your time. All right, thank you.