 Good afternoon, everyone. My name is Martin Donikey. I'm a massive student working with Dr Nick Rattenby at the University of Auckland, Department of Physics. I'm working in the area of gravitational microlensing. The particular project I'm working on at the moment is a data mining project using open source software. I'm working with the MOLA catalogue, so you've probably heard a bit about MOLA today. Just to remind everyone, MOLA is the Micro-Lensing Observations in Astrophysics, which is a joint New Zealand-Japan collaboration, originally set up to investigate the nature of that matter. Just a brief bit about gravitational lensing. As Nick was talking about earlier, space-time is curved in the presence of a gravitational field. The effect of that, as the light rays passing close to the massive object are bent, means that background objects are distorted, producing multiple images. I've got a couple of little diagrams here, so you can see here. We've got the background object here, the lens here, which in this case is a galaxy cluster. Observing from the earth here, you can see the effect of the mass of the galaxy cluster there on the background object. The light-these-right-lays are deflected, and you see these as two separate images here on either side. Here's just an image from the Hubble Space Telescope, just showing us the same real example of this effect. Here we have the foreground red galaxy here, which is the lensing object, and we have this blue galaxy in the background. The alignment in this case is such that it's almost the complete eye and sign ring there. The alignment is almost swollen, so the image is almost a complete full circle around. As for microlensing, in the case of microlensing, the object is typically a star. As we're saying earlier, the separation of the image is related to the mass of the lensing object. In the case of a star, the image separation is quite small on the order of milliart seconds, which means that we can't actually resolve the two images separately. However, if the lens, so in this case a star, is moving across a line of sight, so here's a line of sight from the earth to the source. If we have a star moving across this line of sight here, then that amplifies the signal. Here is just a little to show the geometry. These deflected rays from the source object coming to the observer deflected by the lens, and we see these two separate images, which, as we're saying, in the case of microlensing, the mass of the lens is not strong enough to resolve the two of them. However, if we monitor this over time, so here we have our telescope on the ground, this diagram, the lensing object, is a black hole but it's irrelevant because the only thing that matters is the mass of the object, and again that comes back to what Nick was saying about one of the nice things about microlensing, is that the lensing object doesn't need to be emitting any light, you can still infer information about it. So if we're monitoring this star from our position on the earth, before the lens passes across the front, we see it as its normal background star, the lens comes across the front amplifies the signal so we see it as a brightness. So just to show you a light curve, I think Nick had this one earlier as well, so this is some real data, and we have the different telescopes here showing the data points, so you can see the data sparse and regularly sampled. So yeah, so this is the typical swt you would see for a single lensing object. However, this type of variation, this type of brightening, with time is not something that's unique to microlensing. You see it with a lot of other different types of astrophysical phenomena, so you have a lot of different types of variable stars, cephid variables, RLRAs, et cetera. You've clicked some binaries that Alex was talking about earlier as well. All these different things have been cataloged by Moa as well as a section for the microlensing, because as Nick was saying, the Moa telescope uses a different imaging analysis, so it basically just has a reference image, and it looks for any change whether that's a brightening, any change in brightness. So the result of all that is that Moa has also cataloged thousands of other astrophysical events as well. So at the moment, this data is all just sitting from the mid-90s, it's all sitting just in one of the servers over at Massey, because there's just too much data for anyone to manually go through and look at any large subset of this data. So the aim of my project is to look at ways to classify these data into distinct types of astrophysical event. So just for a sense of scale, so we're talking, in the case of Moa, Moa observes 50 million stars towards the galactic bulging LMC. Gaia here, so that's a space-based telescope, which was started operating just at the end of last year. That's going to be, that's observing the position of about a billion stars across the galaxy. It's going to measure their position with the highest position yet. And LSST, so that's the largest in optic survey telescope. That's going to be coming online within the next decade, and that's going to be collecting about 20 terabytes of data every night. And also as Nicholas was just talking about with Square Kilometre Aray, again you're talking about terabytes a second. So it's an enormous amount of data now, and it really is a year of big astronomy. So if we can't manually process all that, how do we do it? The solution is to use machine learning algorithms. So the idea being that we use automated classifiers to separate the instances of each of these objects in multi-dimensional attribute space. In order to do that, we've decided to use Weka, which is the Waikato environment for knowledge analysis. And that's an open source Java-based toolbox of machine learning algorithms which was developed by the University of Waikato in Hamilton. So some of the typical problem types that the Weka can handle. So classification, obviously that's what we're concerned with here. But it can also be used for regression or clustering problems. One of the things that you attract to us about Weka is the compatibility. So Octave, look at the R Weka, which is a Weka implementation of R. There's also a Python wrapper, which allows you to incorporate Weka as well. So as well as compatibility aspect of it. It was the fact that, as I say, I'm doing a masses this year, so time's restricted so we really wanted something that was ready to go straight away, that we knew would be reliable, and Weka is well supported. The guys down there had been really helpful and they'd offered to provide us with support. Obviously they could be geographically close as well, which is another big advantage. So that was our decision to go with the Weka guys. There are alternatives. The most notable one is probably Astro ML, which is a Python module built on NumPy, SciPy and Matplotlib, which is another good alternative. But as I say, it was really the aspect of the fact that Weka was well supported by the guys down there and we knew we could rely on them if we needed help getting this classifier running. So, as for the classifier itself, we've chosen to go with the random forest classifier. So that's supervised, so basically it just means that you're using a training set, you're not just running it blind. The main features that I tried to build it was the accuracy of it. It's fast computationally, and it's really robust outliers and irrelevant features. So I've talked a little bit more about that in a second. So how does it work itself? So basically it's a tree-based classifier and each node of the tree, the classifier takes a random subset of the feature, so you might have 10, 20, 30 different features or attributes that characterize your data. The classifier takes a random subset of them and chooses the best attribute to split that node of the tree from that random subset. You go down the next node of the tree and it continues again until you get to that node size of one. So once you've got one tree, you repeat the process again until you have multiple trees and so you may make hundreds or thousands of trees, effectively a forest, hence the name. And how you estimate the probabilities is basically once you have your forest built, you take the data that you want to classify and you just run it down all these thousands of trees and effectively these trees vote for the most likely class. So this is just a little illustration. So here we are, so this would be one tree, that's another tree all the way up here. So you can see in this case, so each node in this tree there's a different attribute that's been chosen to split the data. You take your data point here or your instance of the class, you run it down here, you can see in this case this tree votes for this class, this tree votes for another class, this tree votes for another class and then you average the results over all the trees and that is what gives you the most likely classification for that instance of the data. So again this is just another, just to illustrate, so you have different trees and each one of these trees will be slightly different and again so you can see the probabilities from this one and averages, so that's basically just the probability of a class given your input data, averages results out and that will give you the most likely classification for that particular instance. So how do we apply this in this? So first of all we need to build the training set so in the case of Moa we're actually using synthetic light curves and in the ideal world we would have had this data would all have been previously labelled through third-party methods, people going through it manually so we'd have a stack of all the different types of events and then we could have taken them and used that to trade our class for however as Nick was saying, just manpower issues really, that's not been done with the Moa database but it has been done with Ogil which is another micro-linked collaboration which is joined between the Polish astronomers and the US. It's all their data, it's all free and open data, it's online, the whole catalogue's up there so basically we've just taken we're using their light curves, reconstructing their light curves that they've previously classified into all these different types of events then from that we're then adding noise from our database so again the synthetic data so we want to try and make it we need to add our own noise because every telescope of the instrumentation the noise is going to be different depending on your setup so we're going to add noise to these template light curves and then after that we'll train the classifier so to train the classifier basically that just means you're extracting all the different features from what you're trying to characterise your light curve so you'll be looking for the period analysis I'm using what Alex was talking about earlier the conditional entropy so we're using for that so we want to take each of those light curves pre-processing so we can extract all these features from them so we want lights of the period we want to fit harmonics to them as well things like skewness and kurtosis all these different, there's a huge array of different features that you can use to best classify the light curve and that's part of basically what the a big part of the project is going to be once we're training this classifier is going to be what's the best selection of features to characterise the data how we're going to balance that with computation time et cetera because the main thing with the random forest method is building the actual classifier once you've got your forest built it's relatively comparably trivial just to take your data and then run it down all the seas of trees so that's the big thing is training the classifier so again so that's all all the pre-processing stuff we're using python for that numpy, scipy, that's all done in that yep, and as I say using the conditional entropy method that Alex was talking about earlier yeah, and then once we've got that once we've got our classifier all trained then it's a case of running our data through and seeing what pops out and then that's going to lead to the select of Alex's looking at eclips and binaries so future if this is if you were looking if you we would hope to put all this online and similar to what always done so that all this would be free and old open access to all this this whole catalogue so if you're in a particular area you're interested in eclips and binaries you just look up the catalogue and you've got it right there instead of at the moment Alex is interested in eclips and binaries he's unfortunately got to just suffer and struggle through the data himself so just with a view sorry, just a couple of other considerations so also to improve some of the there's a couple of groups in ES that have been looking at this sort of stuff with a view to future LSSD and GAIA and the like and one of the couple of ways to improve the performance of the classifier one of which is active learning so instead of instead of sorry, instead of getting to the point where the classifier is not there's a some of the instances of the classifier that's not going to be very accurate instead of putting instead of putting them into different categories you can basically hold them back for a human somebody to manually go through and look at the sort of and identify ease by eye and then classify them which can basically this combination of some of the groups have found that this sort of combination of using the automated aspects but then with some of the stuff that's noisy and uncertain things like that by having a human to manually analyse that can improve the performance considerably and also there's other options of noiseification or denoiseification so in that instance you basically try to take the noise out of your data and match it to the ideal like curve which is kind of what we are initially started with with the old stuff so that's a couple of other considerations that can you know as I say that can possibly improve things but that's for future and just to briefly say sort of the long term goal of this this type of work I mean what I'm doing at the moment is kind of laying the foundation but in future the ultimate aim would be to get a real time classifier running on down in the motel scope so I said John was talking about Supernovae earlier so basically you want to have your classifier running in real time so that as these events are happening so the typical microlines and events timescale will be a couple weeks so you basically want to have your classifier running so that as you get more data points you're continually narrowing down what this event is because if it's something interesting, something rare like a Supernovae then you want to get other telescopes on it and alert the community so again that's just for future that we have to take any questions Thanks Martin that was great do we have any questions for Martin oh we have lots of questions excellent excellent thank you first congratulations for your next presentation how much power would your algorithm consume for machine learning well it's really just the training the classifier aspect so that's a good question we're going to be working we've got the Nessie cluster and the university to use for that so again that's really something that's to be determined by as I say all the what features we use how many we use what sort of how many bins we want to use do we want to just categorise it into four by different types of event or we want to subdivide it further into 10 or 20 that's really that will really be determined by that that's your happy with Nessie yes we've got that that's happy to use that option yet one of the things that I've seen as a problem in rule-based or sort of machine learning systems is it doesn't really give you the machine learning rules aren't really equatable to real world axioms that we might use to explain the data ourselves is there a way of do you see a way of getting using the machine learning system to actually tell us some general rules about the data sets you're working with do you mean a bit different types of event I mean do you mean to be general general observations that can help us predict not just how that particular system works but that may actually give us insights into other systems other properties of stars that we're not current that this particular system doesn't look at that would be kind of the point of the the sort of like as I say what Nick was saying earlier we've seen this type this is available stars this is Michael Ensign this is different we've never seen this before and I guess that would be the way they can use science to come in if this is an event that's not been seen before how can we use what we know at the moment to try and how can we use our current theories to actually best understand what this has shown us for that do we have some more questions if you have questions leave your hands up and we'll get to you we do have a little bit of time so we're happy to take some more thanks Martin very much