 for the introduction, Kevin. So for those who work in astronomy, think of me as a computer scientist. For those who work in computer science, I work on astronomy, okay, take that, okay. So today I'll be talking about this Kira project, which is an astronomy image processing pipeline we build with our participants and Hadoop. So, I mean, this is a collaboration with my fellow Cal Barbary and Frank, Ellen are from computer science. Oliver, Dave, Michael Franklin, who is my advisor, and Saul Pomato, who is the director here. Okay, cool. So, you know, people have been counting the stars since very long time ago, and years after years, we have been using more advanced telescopes, and we replace the rulers and pen-pencils with supercomputers, right, to record observations. And if you look at the recent, the upcoming LSST Sky Survey, here's the numbers for the requirements. So they have a sustained throughput of literally about 330 megabytes per second. And there is a less than, within 60-second latency processing requirement for real-time community alerts or transient event alerts, right? So, given a typical supernova detection pipeline, we usually take images, we extract sources, estimate point spread functions, and then we do reproduction co-edition, and do extraction again, and with classifications, we identify astronomical object catalog, okay? So visually, it's like we're taking images, and then, specifically for source extraction, we're taking images as input and output, a list of astronomy objects with a bunch of statistics parameters. And we're combining that SEP library, which is Source Extraction Python Library, which is authored by Kyle Barbery, my fellow, and we build Spark and have put them all together, and as a computer scientist, I show we're running faster, okay? So first, on the clouds, we compare the Spark HDFS solution to the conventional HPC software stack. We run Gloucester FS across the same nodes, then we use a parallel scripting approach to enable parallelization because this application is, in particular, inversely parallel. So these are the performance measurements. So what we can see is the Spark solution, Kira SE, runs about four to five times faster on the clouds with the two different solutions. The driving reason is really because this application is data intensive. It's data intensive. So from a profile study, 80% of the time is actually reading and writing files. So only a small amount of time is consumed by computation, okay? And you know, since it's data intensive, we try out a solid state disk on the 16, sorry, on the Institute cluster, we observe another performance improvement by a factor of two, okay? Then we compare this performance on clouds, the Kira performance on clouds, compared to the equivalency implementation on the existing supercomputer. So Kira is about 1.8 times faster than supercomputers, plus we observe a much more stable performance for Kira on clouds. This is again due to the nature of this application. It's data, it's IO intensive, and the high variance of the supercomputer performance is due to its IO architecture because that IO network is shared among users on supercomputers, okay? For the streaming part, we set up a 16 cluster node and we try to investigate a configuration which is set through the batch interval parameter for Spark, which means we, for a certain amount of time, we'll accumulate a bunch of files, then we put them in the mini batch that execute it. And we executed the processing latency as the time when we get the first file and the time when we finished processing the last file in that mini batch, okay? So we call this a feasible deployment. And given Spark's internal overhead, we find out the batch interval for an eight seconds results in valid deployments and longer, all longer batch intervals are also valid. And the observed latency is about eight seconds, 16 seconds respectively. Both deployments and K-POP with 780 megabytes per second data generation rate. So if you remember previously, the requirements for throughput is about 330. We have a lot of headrooms for additional processing. And of course, these two latency numbers are within the last and 60 second latency. Okay, so the conclusion is we demonstrate linear scalability with both data sets and cluster size for Kira. And due to the data locality optimization, Kira runs about four to five times faster than the equivalent C implementation running on parallel file systems. And we also show that Kira runs faster than the supercomputer performance for this particular application. Okay, and there is a trend that we can always leverage the big data platform such as Spark, HDFS, and what they come up with to advance data-intensive data-driven science. And that's basically the conclusion. I think I'm taking six minutes. Okay, cool, yeah, that's it.