 Dr Rahul Ramachandran is deputy editor for earth science informatics and a principal research scientist at the Information Technology and Systems Centre at the University of Alabama in Huntsville. Rahul has pioneered the concept of data prospecting, which sits between data mining and data discovery. He is also chair of the newly formed research data reliance working group on big data analytics. He graduated with a PhD in 2002 from the atmospheric science department at the University of Alabama in Huntsville and was selected for the 2010 Presidential Early Career Award for scientists and engineers. In this presentation, he speaks about his work on data discovery and big complex data in the domain of atmospheric science. Thank you for inviting me to give this talk. So here's the outline art. I'll go a few slides to introduce who am I, what I do in terms of research, talk a little bit more about the research lab where I work. A lab actually runs a data archive, I think very much similar demands, so there may be things of interest there. And then I'll talk about two ongoing data projects that might be of interest to people here. The first one is actually looking at data and information aggregation. The other one is big data looking at my event analytics. I have a rather eclectic educational background that's kind of all over the place. I guess I'm still an engineer. That's what I did as an undergrad, but I have degrees in atmospheric science and computer science. A little bit about my center. So we are a research lab, a research focus. We've done a lot of work in data mining and analysis discovery or obviously work in the area of informatics. We do a fair amount of organic stuff, built games that are used by the US military for training. So a lot of our funding comes from all these three letter agencies in the US. So we also run one of the 12 NASA data archives. So it's basically a fully operational data archive, we do operational data ingests, processing, archiving, distribution. And our primary data sets are lightning data. We also hold a lot of the ground validation free campaign data that we get at NASA once for instrument validation. And then we have the microwave data products. That's actually an important part of the planet science. You all know this. This is the data life cycle. What has changed in the recent years is that, you know, data is no longer treated as a second class citizen. It's kind of treated as a first class object. The importance of data is now being slowly realized. There's a code that data is now the currency in science. What we try to do is to figure out, you know, what are the processes and efficiencies that are there in the data life cycle? How can we make scientific process be faster, better, more productive? So what's happening is as technology is evolving, some of the process components may be getting obsolete. So we have to, you know, start looking at new solutions that can be used. And the other thing that's happening in the U.S. and it may be happening there too is you are getting new policy requirements that you have to deal with. You know, like NSF now requires a data management plan. NASA has a data preservation requirement that, you know, you have to handle all the metadata details so that you can... You have enough contextual understanding to years down the road that you cannot reuse the data. And the other thing that's coming down our horizon is the whole notion of reproducibility and executable paper. That's the gold standard that everyone's going through with, you know, science, especially the aspects of science that have major policy implications, have to make sure that you can reproduce those results. So the area that I work in is third science informatics, and that's basically looking at, you know, how do you apply systematic, you know, technology approaches to the entire aspects of the data lifecycle, and not just the knowledge extraction and the decisions that are part of it, but the whole, you know, even the acquisition, processing, you know, how do you gather information, all that part of it. And the important thing is providing, you know, very customized solutions to the stakeholders, and not giving tools that they cannot utilize. Now I'm going to transition. This is pretty much a background on what I do. And I'm going to talk about two ongoing projects, and this presentation about these projects are at a fairly high level. So this is a slide that's required now to give a definition of big data. The first one is Gartner's definition of big data that everybody knows of, where you have the whole notion of velocity volume that, you know, really understands the velocity that you have real time aspects to the data, and the notion of the variety that there's different kinds of data that you're getting that will have different kinds of quality information, format types that you have to handle. I really don't like Gartner's definition for big data. That's a different perspective, I guess. So this is Jim Fruh from University of Santa Barbara. His definition I like, I think it's more from a data-centered perspective. It's, you know, you can't really move it. It's like an organ. If you want to play, you have to go to the organ, the organ comes here. So that's with the big data. You can't move it. If you want to use it, you have to go where it is. So implications to data-centered around the world is that, you know, now you have to start looking at systems that can actually do analysis on data. So the two projects that I'm presenting here, the first one is actually focusing on this whole notion of variety. You know, how do you, now actually you have so much of information and distributed data that you can get on the web, different locations, different sources. How can you automate aggregation around events of interest? The second case is actually looking at more analytics, looking at event analytics. So here's the first project. This is a NASA-funded project. It's called curated data outcomes for science case studies. And the concept here is that, you know, a data album is basically a compiled collection of information around an event of interest. So this compiled collection includes not just the data files that you want to use for studying that event, but also has links to services, tools, news reports, videos, you know, anything that gives you full contextual understanding that's useful for studying that event around the world. And the curation here is to allow an end user to basically customize the data album for their particular study, because each user may have a different view of how they want, what they want to study that event for. So the motivation behind building a tool like this is that, you know, it is an atmospheric science. One of the most common research that is done is case study analysis and planetary studies focused on a significant event. You have a major flooding event, but you have a major hurricane coming through. Knowledge research is done on understanding how that event occurred. So to do that, what we need is a wide variety of data and information from all the distributed locations that are there. So for example, NASA has all these different DACs, each DAC is holding one kind of data set. If you are an individual researcher, then you have to figure out where to go and get the data based on the metadata that's there provided. The other thing is science is also becoming very interdisciplinary. You may have users who may not be experts in a particular data set, or they may not know the exact vocabulary or metadata term to use to find, to do the search right. So how do you support users like that? So the whole gathering of data and information around this whole notion of events is actually very tedious and time consuming. So the challenge is to build a tool that can do this gathering of information in an automated manner. But the gathering part is actually easy part. The hard part is how do you figure out what is relevant in terms of what's out there. So that's the challenge here is once you're gathering stuff, figuring out what to filter and what to keep. The other part of it is the metadata tends to be fairly boring. How do you present this collated information in a manner that is actually more useful and intuitive? We can present metadata in a really dry manner. So can we do something a little different here? That is the challenge in the sense of building this tool. So the science driver here is for hurricane science. It's probably the easiest event to start with because it's a major event. There's lots of information about it. There's information about tracks and how does the hurricane actually progress. So the goal was to use this as the first science driver and build catalogs on all the hurricane events for last 20 years. Focus is not just on the data, but it's also on the information that is normally required. The background information, what was the damage caused, how many deaths were there. And all of these things require parsing through webpages or PDF files of storm reports. This is the conceptual architecture. You have these different resources on the left-hand side that are coming actually from all these agencies in the US. You may have things that are crowdsourced like videos and pictures on YouTube that you can get. So the goal is to aggregate this, figure out what is relevant and then put it in a structured form and present it to a end user so that they can actually utilize this information. So this is the data architecture for the tool. There is an engine that drives the different brokers that talk to the different stupid resources and then puts everything together in a no security database because some of the data gets so large. Effectively querying it is an issue. And there is a service layer and then at the top we have a presentation layer where we provide a user to allow us to do some interactive analytics. There is interactive visualization and a pastive visual search part. The piece I'm going to talk about is the piece that's new is this whole anthology-based relevancy ranking capability. And then obviously I'll show you a little bit about the tool. So the ranking service, the anthology is ranking service is actually designed as a general service that can be customized to many different applications. And it combines this anthology-based score and a traditional statistical score as based on these two papers by Baramo and the score-ranked paper. I won't go into the details of the algorithm but there are two components here. The first one is an anthology component where we have an application anthology. In this case, we have an anthology for a hurricane. For all the concepts in the anthology, we calculate the weights, basically the linkages between the concepts. So the more connected a concept is, the higher weight it has. And then we calculate an activation value. So not everything in the anthology, not all the concepts in the anthology are important. There are certain ones that are much higher of a higher value. So those are the key ones. That's where the search starts. So those concepts get a much higher activation values compared to the other ones. And then we use the very standard DFID model for statistical calculations. We do a term frequency calculation for a word and then we do an inverse document frequency calculation for a word. So the relevancy score where we calculate is that we look at a document and look at all the metadata. Then we match against the concepts in the anthology and then we calculate a score based on the anthology. We calculate a score based on the DFID model. And then that is given a relevancy score. So that's how we can do some relevancy filtering on the information that we are gathering. Question from Andrew Charo. So I can see how it's going to work for documents. I can see how it's going to work for some of the non-document resources that you're showing in the bottom layer of your conceptual architecture. Done like what? That's on YouTube, for instance. So what we do in YouTube is we do a query expansion. The anthology provides a query expansion. If you're searching for, say, a hurricane sanding. So we do just a hurricane sanding on YouTube. You may get people who call sanding. So we can use the anthology to automate the query expansion. So we add more detail so that does the relevancy filtering. So we looked at how well this algorithm works. Clearly, the algorithm can be improved quite a bit. But this is the rest of the work. There are known things we know we can improve it. Not in terms of the algorithm itself. So we compared the algorithm against truth data, which is against our data center collections for hurricanes. So we manually selected 35 data collections and we compared it against the top 35 that were returned by the hurricane. So we get an accuracy of about 80%. The precision and recall is about 60. So ideally, you want high precision and high recall. But normally that never happens. So for search purposes, we have tuned it to be at a high recall and a much lower frequency. The goal here is to make sure that everything that is important is as part of the search process. So maybe I should demo this page. So this is the information that's been aggregated for the different hurricanes. You have three different views. We still have people who like things in a tabular form. She insisted on having this to you. There's a different view of like a bubble. You can see the different particular year, the different storms and the size. The color is the category of the storm and the size is the number of information that's been collected for the particular storm. The sunburst used the same thing, but you now have a way of doing a visual pass that's on. So you can actually bring it down to a particular year. You can see the storms based on different categories. And again, the angle of the size is the storm duration in this case. And if you select a particular storm, so this is all the aggregated information that we are getting. So this is from Wikipedia, all of this information, the tabular information. All this is coming by passing PDF reports. Things like this that are important for the users. Like how well did they do the forecast for this particular storm versus their record. So this is the easy brokerage to introduce particular kinds of reports. Yeah. And passing the PDF is based on a rule basis. And then you have for the particular storm itself. Now you have the actual data sets. So these are the different data collections that you have. And you have all the granules that you can actually get a list of that are on the server that you can actually have for studying the record storm. You can search based on keywords or instruments. And the user can change the threshold that are relevant to your actual threshold. So if they think the threshold is too high or too low and they're not getting enough results, then they can change the threshold. Oops. Sorry about that. You can select your individual, somehow it's not refreshing because I'm using that. If you select an individual data collection, then you get only those five that you can utilize to study. So this is, you know, kind of a different way of gathering distributed information.