 Welcome to my talk. Thanks for coming. I want to talk to you about Spark and the work we did to build this system for astronomical data analysis with Spark and Some of the cool things about astronomy. I learned along the way so I am CTO a test group Company of about 50 people in Zagreb Croatia. I'm also PhD student at the University of Zagreb Computer Science PhD student and I'm collaborating with people at the University of Washington at the Direct Institute They are the astronomers. I've been been working on this system with I Also wrote the spark in action book which came out in 2016 already. It's been already three years and It was you know Wrote written for spark 2.0. Spark 3.0 is about to come out now. So it's kind of oldish But I'm also providing spark courses. So if you need one course for your team, you can ping me Okay, so it's a we're talking about sky and so fascinating. It's been fascinating as humans for millennia And it still holds all those many unanswered questions Like how did it all start and what's it's made of and What is the future of the universe and are we alone? We still don't know answers to many of these questions. We have some ideas, but we really don't know and The way we've been trying to answer those questions But it's just looking at the sky of course and we built many different contraptions called telescopes and bigger and bigger ones and some of them really huge and we've put some of them even into space as you know and They've been discovering these really cool and things and this is one of those fascinating things I've never heard before before I starting doing this. Do you do it? Does anybody here know what this is? Okay, only a couple of hands. So I'm glad I can I can tell you about this. So this is a Strong lensing effect which was predicted by Einstein back in 20s This is called the Einstein ring. So what happens is that there's a supernova explosion somewhere behind a supermassive galaxy and Which stands between us and the supernova? right at the middle Right on the straight line between us and the supernova So the some of the light that would just radially Go in all the directions some of it curves around this supermassive galaxy because it's so massive It bends light and some of that light centers on the earth and produces this Interesting effect. So that's pretty cool. And they're also cool things and One of the other things that I never knew about is that there's this weak lensing thing where Actually everything that we look at the all the images that we look from the deep sky is Actually somehow bent and the light always bends around those galaxies and science scientists have been trying to map how these lights goes around the mass throughout the the universe and to do that they they try to correlate how How bent these images of neighboring galaxies are to try to correlate them and then try to find those clumps of mass and To do that they need the really precise measurements and really good pictures and that's where LSST comes in and that's this new big telescope being built in Theropachon in Chile It stands for large synoptic survey telescope a really big huge project It is both wide and deep telescopes are usually either wide meaning that they cover a large regions of the sky or Deep meaning that they they can capture really the faintest objects in the sky So it's really rare that the one on telescope can do both Go wide and deep at the same time So it will have house the largest camera in the world once it's complete It will continue to observe the night sky during 10 years and produce the first video of the universe in history Because it will eat every two and a half nights It will cover the whole sky and do that for 10 years. So that will produce About seven to 800 images of each region of the sky on average And it will it will start the full science operations 2023 So the goals of the this project is to answer those questions about dark matter and dark energy We don't know what they what those are. We know that there's some Matter that we cannot see so it's called dark matter because some some of the galaxies that are clusters of galaxies are held together and If there was no other Matter but that we which we can see those galaxy would are likely to would just fall apart So there's something holding them together and we don't know what that is So they're calling it dark matter and there's also dark energy this presumable force that driving the Expansion of the universe The universe has been expanding right and we don't know why So that's first goal second is to make the catalogue of solar system objects including the potentially dangerous asteroids because NASA American Congress gave us NASA the Mandate to find 80% of potentially earth threatening asteroids by 2020 something so that's The second goal the the third goal is to track Transients and those are all the when you take two pictures of the same region of the sky and all the differences between those two pictures if Something moves something disappears or something appears on the second one. That's the a transient so LSSD pipeline will produce alerts to the scientific community in 60 seconds from Detection of any such transients there can be millions of them during one night and The final goal is to study its structure and formation of our galaxy So just to give you an idea of how expensive this whole project is. This is the building. It's almost complete now on top of this hill in Chile and The telescope itself has 350 tons So this is the image of the telescope and the mirror This unique design of the mirror it has both the primary and tertiary mirror on the same on the same mirror so the outer Outer part of this mirror is the primary and the dent in the middle is a tertiary mirror And the secondary mirror is above in the in the In the air suspended so that that gives the LSSD ability to go both wide and deep and the Camera is a largest camera in the world. It will have 3.2 giga pixel images which Translates to 6.4 gigabytes and if you want to Show just one of those images in in real resolution. You would need thought of 1500 high definition TVs, which sounds like, you know Impressive and it will take a new image every 20 seconds every night. So for it for 10 years So in numbers that would be the total of about 80 maybe 90 petabytes Which is not that huge compared to some of the companies today, but still it's Considerable amount of data and especially for astronomy and and It will it will have that real-time streaming aspect to it with the alerts and there will be What nightly alert to 10 millions of nightly alerts going to the out to the scientific community and All of this data will be stored in a queryable catalogue. So that's where we are now Coming to the subject of the talk because each of these telescopes they produce this database of measurements of each little points in the sky, whether it be asteroid star or a Galaxy that all goes in a big Astronomical catalogue so each row can be Detections you have many columns different kinds of measurements with different filters and so on and Typically each each of these projects astronomical project has has their own Image processing pipeline, which is really not a trivial thing There's many things for this piece of software to solve or pieces of software You have to estimate the background only on each image. That's really important to to have accurate and precise measurements of to be able to compare to previous measurements Then the PSF Estimation PSF is the point sprint function an image or a blurb of light that one one just one point source of light makes on the focal plane of the That you get on the image. So it depends on the moment. It depends on the atmosphere. It depends on the Position on the telescope where the telescope is heading and and so on and Then you have to detect all those objects on in the image. Sometimes they are not detectable in one image You don't see them on one image. Then you have to co-add several images to for for them to appear then for example the blending where many stars there are those crowded stellar Fields where there are many many stars on the same Really close to each other. So you have to separate light coming from many of them So that's called the blending and those are all really complicated problems so once that's done you have a this catalog with measurements which are good as your Processing pipeline is and that's when actually the astronomers start to ask Questions and they usually do this with the some kind of SQL interface. For example, as the SS project has this open you can everybody can access this online and you can issue queries just SQL queries and Obtain some kind of results you download that locally and they would typically Analyze this usually usually with Python astronomers mostly just know Python if you mention anything else they don't want to talk to you and So they produce these kinds of you know some scientific results from this And but this traditional approach doesn't really scale. Well LSSD project will provide these science platform And they will use Jupyter notebooks and other tools. However, still You would want to they they often want to combine multiple catalogs from LSSD and Gaia Mission and some other missions and combine all of the measurements from from all of these because some of them have you know different approaches and they can really Get new insights into data from when they combine the data from all of them and They often want to work on the full datasets. So they want to test their theories on Billions of objects at the same time on the whole sky So that's where this Excess system with that were built With with the folks astronomers at the direct Institute in units to Washington That stands for astronomy extensions for spark So they saw this need for new generation astronomical analysis tool That would be efficient in this cross-matching where you need to combine multiple catalogs You have billions and billions of stars in both and you need to efficient do that efficiently and potentially on the fly then That's based on industry standards like a patchy spark so that you don't have to Do your custom made code and then maintain it and solve already solved problems and That would provide simple but powerful astronomical API extensions and To be used to be easy to use on premises or or in the cloud The goal is to just spin it up somewhere in the on Amazon or somewhere else Do your analysis on the data that's already there and then then publish the results The access history was the system of a rather unfortunate name LSD by Mario Yurich from direct Institute his professor of astronomy there at University of Washington and and That was a tool for querying and still is still used tool for querying cross-matching analysis of positional and temporarily index data sets, so it's not just for astronomy can be used for other also data sets, but Mostly used by astronomers it was inspired and started Inspired by Google's big table map reduce papers similarly to Hadoop and started about at the same time So it was kind of you know Ahead of its time But today it has this fixed data partitioning which introduces a significant data skew and also It partitions the data based on time so however astronomers don't They most of their queries Spend the whole So-called light curves when you have many measurements for each for one object in many moments That's called the light curve. So they will often want most often want to To analyze the whole light curve at the same time and not partitioned slice by time and It is not resilient to worker failures in a way that for example spark is And also contains lots of lots of custom solutions as I said, it's not industry based. It's not Backed by big community and so on And also it needs to be ported to Python 3.0 and when you get all of these together, it's makes much more sense to build a new new system. So First of all, let me explain what this cross-matching thing is when we have Deck and RA coordinates those are RA is right ascension meaning going from 0 to 360 that way Telescope Rotate rotates and deck is declination going from minus 90 to 90. So that's the coordinate system and then you we put two Detections from two catalogs on on the same image, for example, we denote one of one of those catalogs with dots and the other with the Crosses and the circles are the search Radius with we would like to search. So we would like to find all the matches from the second catalog That are within some distance from all the rows from the first catalog. So if you have billions and billions of Rows in both tables, that's a you know, not not a trivial thing. It's not not so complicated to understand what's the Task, but it's it's not trivial when it's no big data So the way we solve this is By basically by data partitioning, that's the fundamental thing that makes it fast in access We based it on late Jim Gray's Zones algorithm. He was a part of Microsoft research and they built They came up with the set of an indexing method and SQL queries to make Make this fast on SQL Microsoft SQL server So it took this idea, but just adapted it for for the distributed architectures and they they divided the sky into horizontal zones of a certain height and That's the basic partitioning scheme of the data So in our distributed version we store data in parquet files, which is arguably the the most often used Format along with the spark or maybe even in big data world And we bucket those files by zones buckets are just physical files So when we have many buckets Forming a single parquet file spark and read all of those buckets and treat it as a single single file or table and We bucket by zone meaning that all the rows that belong to the same Zone that stripe of sky they go into the same physical file and We sort those files by zone because many zones can go into safe into the same file because we don't have we have More zones than than files. We sort those files by zone and are a columns and We also Take a narrow strip on the lower Border of each zone and duplicate those all of those rows to the zone below so that when we try to find the Objects that are spanning the border in the neighbor neighborhood of each point There could be some of them that span the going that are found in the Neighboring zone So in order for for us to be able just to use data from a single file and not not require any data shuffling inside the Cluster we duplicate the data parts of the data to Into different files So this is just an example with four buckets and 16 zones. So this is what I've been talking about if we have All the objects from zone one and five and nine will go to bucket one and from zone two and the zone six to bucket two and so on So when we try to then join two tables like When the data is partitioned like this then spark will will be able to do this independently and in parallel for each pair of buckets from the two tables and So if we have only four executors like shown here and we have two catalogs or parquet files or tables Each bucket it in the same way with the same number of zones Then each executor can take one corresponding pair of Files buckets and join them read them join them independently of the others if you had much more buckets and Smaller number of executors then you would do that repeatedly several times in iterations and it would take more time, but So it's scalable as much Executors you give it it can do more in parallel, right? And this is the I Don't know if you're familiar with spark but spark has this web UI and part of it is Spark SQL UI where you can see for each of your queries. You can see the physical logical and physical optimized plan for each for each of the queries. So what this says here you have Two columns corresponding to each file and it just says I will scan the parquet file filter some of the Rows based on some conditions and then do a sort that's mandatory a bit before the sort merge join and This sort does does nothing actually because the date is already sorted. So we have two files all both are read at us In the same way and then they are directly brought to the join step if you had a separate Box saying exchange that would mean that the spark would need to shuffle the data around the cluster So that's something you want to avoid in spark you don't want shuffling data if you don't need to and in this case you really don't need to and This is the ideal Situation you want So this is really parallelizable and real fast However, there's one more detail and one more problem with spark because when you have a query like such as this one Select from two tables where we join by zone. There's that's an equity join on the zone column and Then we also have that RA Coordinate is between some range of the RA coordinates from the second Table Then what spark actually does is it takes all the rows from corresponding zones from the same zones to zones and we have many many objects in Belonging to the same zone because they span the whole sky horizontally so what what spark does is it does the Cartesian join each row by each row. So that's a huge amount of pairs and only then does it try to Filter them out based on the second Condition here this between condition So we implemented this what's actually sometimes called epsilon join or Variation of this epsilon join and we file this spark Jira ticket with this change and we hope that it will be emerging to spark soon So we call it sort merge join in the range optimization But what it does is it just maintains a moving window over the rows from on the on the second table as as the rows from the first table Change so it just checks if the For each row and the second table if it should be Removed from the window or should the next row be added to it and in that way it can it uses the minimum amount of memory required and the minimum amount of Comparisons that are needed so it's really efficient and it just zips by both both tables data from both tables Other approaches that have been used traditionally is hillpicks for partitioning sky into How to partition sky and index the sky Have been hillpicks and the htm hierarchical or triangular mesh. They are both able to dynamically Partition the data on the you can specify the level of granularity that you want On in the runtime on the fly so you can specify, you know, how long how large those Areas you want you want and it's good if you want to describe some complex complex shapes and or Overlapings of those complex shapes however for cross matching and most of the Operations that are actually done for some medical data There we believe at least that it's much slower than zones algorithm. So we have some Performance results and data for in the paper. So this is from from our paper So we took Gaia Catalog that's the European Telescope that's in in in space, I guess, you know Which has 1.7 billion objects and as the SS which has 800 million and then also ZTF with 3 billion objects and We tried to Cross-match those and we also It was all done on a single large machine with about half a terabyte of memory But there was still there was no way that all of that data can fit in in that Memory on the single machine. So we part of the data was We we did two variations of the test with warm cash and cold cash Cold cash meaning that the cash was empty at the beginning We didn't use spark caching because there's no way that we can cash that in spark's memory We use the Linux system fast system cash So we emptied the cash at the beginning So that's cold and then it would read and fill it out that cash While it was running and the second version of the test is warm cash when it was already partially filled So these are the results and we can do this for example Gaia Almost 2 billion with 3 billion in ZTF with warm cash. That's 40 seconds. That's only cross-matching not nothing additional no additional computation just the cross-match and 315 seconds with the cold cash So access API You've used it just as you would use a spark and you get some additional methods with on these access frames So you have to first initialize the access catalog and you give it an instance of spark session because access catalog will look at the sparks Metastore database which holds the data about the tables it knows about and It adds additional metadata about about those tables such as zone height and number of buckets and so on and Then you can use it to load catalogs that you have in your Metastore so here for example I'm reading as the SS catalog and Gaia catalog and that gives you a spark data frame But also that's also a class of access frame So you have additional methods such as cross-match and Then cross-match also gives you just another Access frame which is again a spark frame and then you can do additional stuff You would do with with normal spark programs and other Filtering and maybe machine learning or what not From other functional functionalities we have two versions of the cross-match and we have region queries you can you know Give it two points in the sky and it will return all the objects with between a square Defined by those points or cone queries you give it a one point and the radius so it gives you all the objects there Histogram and 2d histogram where you give it a number of bins it will return number of objects in each of the bins Based on some conditions condition or two conditions in 2d version So you can use array functions if you have light curve data So many many measurements for one single object You can put it put that in an array column and then use park ray functions to analyze that And of course other spark functions even streaming or machine learning graphics or so the next steps is a hard to Make access cloud ready to have at least actually this is already been used this way in a direct institute in Seattle so they they put some of the those catalogs pre partitioned in S3 buckets and You can spin up access on Kubernetes on demand you do your analysis and bring it down what bring it down what once you're done and We hope to empower new astronomical discoveries discoveries in the 21st century So Again, if you need a spark course, you can ping me. There's my email address and you can check out the Documentation and the source code. It's all open source Of access if you're interested and if you have any questions, I'm out here or if not, then thank you