 Welcome to this panel on data modeling and analytics. So we've been talking about various aspects of big data for finance during this conference. This panel is about some of the prerequisites, things that must be addressed to be able to do what we need to do. And these are things like cleaning and transformation of data integration from multiple sources, data modeling, analytics. And getting all of this done from multiplicity of sources and at volume is a challenge. And a point that I wish to underline is that it is a technical challenge that requires advances in terms of our technology and is something that is done through computer science research. And so at least some part of this panel will talk about is what the needs are and how we are accomplishing that and where we are with meeting the needs today. What I want to do is just talk about one example just to set the stage. So most of you in this room are probably familiar with legal entity identifiers. This is a negotiated standard and is a significant step forward in the ability to identify participants in financial transactions across data sets. While figuring out an LEI is a major step forward and it is a step forward that is accomplished through negotiation, it doesn't get us home all the way. It's a major step forward. And the reason that it doesn't is because we have legacy data which may use other identifiers and we still have to deal with these non-compliant data sets because there are complex ownership structures of firms. So just looking at LEIs often doesn't tell us the whole story. We have changes due to mergers and acquisitions that we need to take into account. And so we have a lot of mess that we still have to deal with even in a world where there are perfect LEIs. And this sort of two-part solution is a very common sort of situation when one is dealing with bringing data together. So there are things that you can do by setting up the right systems, by having the right governance, by negotiating the right standards, etc. And the more you can do there, the better off you are. But what you can do there is rarely enough by itself. And then you use technology and specific situations, specific applications, specific technology to get you to the end from however far you've been able to get through the process. So there's a process piece and the technology piece. And so, for example, on top of this LEI, there is a whole series of challenges on financial entity identification and information integration that is being run through the National Institutes for Standards and Technology, NEST, and I've got the URL there. And this series of challenges is precisely meant to address this gap in terms of what LEI doesn't fully accomplish and to try to get multiple parties thinking about the technical issues that need to be resolved to be able to achieve that promise. So anyway, as moderator, I don't want to be giving a big speech. This thing is just a setup for the kinds of issues that I hope our panel will be talking about. And to do that, we have a really wonderful distinguished panel. We have two economists, finance-oriented people on the panel with deep technical understanding from the economics and finance perspective. And then we have two computer scientists who have been looking at issues related to big data for applications such as finance. So we have a good balanced panel. In terms of the order in which we're going to do things, though, we're going to begin and end with the economics and finance piece and sandwich the technical piece in between. That just seemed like the best flow for how we'd do it. So the order of the panelists is what you see on the screen. Our first speaker is Louis Alexander, who's a managing director and the U.S. chief economist for Nomura. And he has spoken and written widely on data issues related to many financial matters. Our second speaker will be Zach Ives, who's a professor of computer information science at the University of Pennsylvania. He's the director of the SING program on networked and social systems engineering. And he is a renowned expert in data integration. Our third speaker is Claire Montelloni, who is an assistant professor of computer science at George Washington University and has done some really impactful work in the analytics of financial data. And our final panelist is Sanjeev Das, who's a William and Janice Terry professor of finance and business analytics at Santa Clara University and has really done a number of very innovative things with exploiting cutting edge data analysis technologies to address questions of interest in finance. So with that I'd like to request our first panelist, Louis Alexander. Thanks very much. A few caveats are in order before I start. When Michael and Dick invited me to this and thanks very much for that. I told them I'm an enthusiastic amateur in this world and they invited me nonetheless. It was just to introduce as an economist, I want to stress I'm a macro economist. And I've kind of gotten involved in this stuff because I happened to be a treasury when Dodd-Frank was passed and got involved in the early efforts to set up the OFR and have stayed in touch through the Treasury's Financial Research Advisory Committee which has been a lot of fun. I wanted to spend most of my efforts in this part of my career on the data issues because I actually think they're both incredibly important and actually the place where we can make a tremendous difference. So I'm going to talk just very generally about kind of how I see this for a few minutes and then turn it over to people who know more about this than I do. The first thing I want to talk a little bit about is like why is this such a problem? Well the thing I think people don't always appreciate is the degree to which the financial sector is really driven by IT. If you look at, if you think about just industries, there are a few industries that are as dependent on information technology as finance is. And finance has been transformed by Moore's law and I think it is very important to just sort of start with that. Now in the data world you see that in terms of the sheer volume of data that is being generated by the system every day. And so part of the problem that we're all talking about is just the sheer magnitude of the stuff that is being produced. Now there's a second problem that's sort of related to that which is that systemic financial crises, the ones we really care about, are actually quite rare. So in the United States we had one in 1907 in the fall of 1907. We had one in the spring of 1933 and then we had one in the fall of 2008. Now if you think about the sort of analytic problem of how you prevent these things, if you're operating on things that are inherently rare and you're talking about an industry that is changing as rapidly as finance is, you've got a problem. I was very happy to hear that the Bank of England is now releasing weekly balance sheet data going back to the 1840s. One of the ways you have to respond to that problem of financial crisis being rare is you've got to look across countries and you've got to look back in time. I actually believe very strongly that it's very important to have sort of an economic history component to this, but one of the challenges is pulling together that data is available. So I'm a big fan of that and I'm going to be looking in my email for that data set. I've actually looked at the UK problem, so that's great. The other thing I want to stress is the data problem we're really talking about exists within financial firms and within the system. So when you think about what is a big financial firm these days, it is something that generates a tremendous amount of data but generates it in a very disaggregated way. So the problem for the people who are managing a large financial firm is a data aggregation problem. So if you're a risk manager or a senior business manager of one of these firms, you're dealing with a problem where the data you're having to look at is produced in silos at very high levels of disaggregation. And that information has got to be aggregated up in a way that's meaningful so that you as a risk manager or a manager sitting at the top of the firm can actually make sensible decisions. That is a very big problem. That problem is actually very close and very similar to the problem that the people who worry about the system face. So what is the problem that the people who worry about the system? They get information not from individual firms, I mean not just from individual firms but across all these markets. And in fact these two problems of how does a firm manage its internal data problem and how to do the people who worry about the system have to confront it are really closely related problems. And essentially the core issue is how do you pull together data that comes from very disparate sources in a way that allows for meaningful comparison and aggregation. And in essence that is the core problem that we face. Now what's the answer to that problem? Well I would argue that the answer to that problem is fundamentally standards. So let me talk about that a little bit and about where we are. Now one of the things I want to stress is I'm in some ways optimistic about this. And the reason I'm optimistic is in part because I think the fact that the financial firms themselves face this problem and we're living in a world where lots of capacity for managing information and what not is being developed that we can take those solutions and apply them more broadly. So in particular what I think is interesting is there's this set of developments that's sort of going on in the broader information world that broadly go under the headings of things like the semantic web, semantic processing, link data, Web 3.0. Broadly what that is about, it's about taking data from different sources and being able to pull it together in a meaningful way. Now the crucial thing you need to do that is essentially a set of definitions of the key concepts that are relevant for whatever problem you're dealing with that are robust, that have the relationships among the key concepts so that you can take data from different sources and if you link it through these robust standards you can then pull it together in that way. And this is a broad movement that's going on sort of in the data world more generally. Well that is essentially exactly the problem that we have in finance. Now the good news is that these standards are being built. So there is something called the financial industry business ontology which is a set of standards as being built by an industry group of data geeks. They've been working on it for over a decade but it now largely exists as this set of robust definitions with all the various relationships and we're starting to see this being applied in the real world. So one of the things that was mentioned earlier was the BCS 239 standard. This is the standard which basically goes to the big firms and say you have to solve your internal data problem. One of the things that's going on is a lot of the big firms are actually using FIBO as the basis for their internal data systems. Now going beyond that we recently had a what I consider to be kind of one of the first real successful tests of this in a kind of systemic context. So State Street Global Advisors took a bunch of its internal derivatives data, aligned it to FIBO and then basically used semantic processing tools to ask a bunch of basic queries. Now the reason I think this is so important is one of the things that came out of Dodd-Frank and is a regulatory initiative broadly has been this idea that we need these swap data repositories. We need to bring all this swap data in one place. The agencies that have done that have had a tremendously difficult time making sense of the information they've gotten because it's just all coming into a bunch of inconsistent forms. What this State Street initiative has essentially demonstrated is possible that if you take all these data that comes from different sources, you link it through these common set of definitions, you can then ask sensible questions of the data. And I'm in fact quite optimistic that we can make progress here. And look, I'm not naive enough to suggest that this is in any sense easy, but I actually think we're at a moment where there are a lot of moving pieces that give me a hope that we're going to make real progress. And really it's what the regulatory side is going to the big financial institutions and saying, it's saying to them you must solve your internal problem. And we're starting to see these utilities that are being developed that allow for this data to be managed more effectively across institutions. So I'm actually kind of optimistic and I come back to what is a kind of perhaps naive view, but I kind of look at this problem and I go, it just seems a little weird to me than in a world where our capacity for producing, manipulating, managing data is growing exponentially in a world where the cost of doing that is essentially going to zero. The fact that we have this problem is a little weird. Ultimately it's institutional. I think it is about governments. If you ask why did the firms not solve these problems internally, it's all about how the firms are governed internally. And I think the sort of the global version of this problem is really about us just going and saying, look guys, we've got to fix this. And so it's not to minimize how difficult those governance and institutional problems are, but let's recognize that that's what the problem is. I'm going to stop there. Thank you. Great. So I'm going to talk also about data integration from a perspective of having tried to build integration platforms for a variety of domains. And I'd like to both in some parts reinforce and in some parts disagree with Dr. Alexander's points. So I think the vision overall is a pretty clear one. Somehow or other what we want to do is pull together all the internal databases and Excel spreadsheets and documents and so on from internal organizations. But I think in the long run you also have to think about we may be bringing in data from kind of the open web and from all sorts of other auxiliary places. And the vision is that somehow we're going to build tools, standards, terminologies, et cetera to clean up and curate and turn that data into something that looks very much like a clean kind of global database so that we can run analytics and perhaps notify on events and so on. And it's a really compelling vision. The challenge is really that especially that step between the bottom of all the data sources and this part of getting it clean ends up introducing a huge number of challenges. So, you know, definitely progress can be made on a lot of them. I mean, the ones that I've heard mentioned here, you know, there's LEI and sort of the notion of trying to standardize identifiers and ontologies and terms. And, you know, obviously there are lists of companies in different areas. There seems to be a lot of potential and existing infrastructure for bringing together the data. But in between all of that, one of the big challenges that we have seen time and again is it's really easy to think that because I can write a converter that brings data from one place to another or I can have people try to define when two things are the same, just inherently a lot of times when people encode data, they encode it in an ambiguous way or they encode it with a bunch of assumptions because they've stored it for some purpose and you're trying to reuse it for a different purpose. Okay, so some examples that I put up here, you know, if I have a text document and for instance we're trying to look at press releases or news reports or whatever, even something as simple as dollars, there actually needs to be some context to figure out which currency, right? And sometimes people will say specifically USD and other times they won't, right? Trying to enforce these standards at a global scale is really difficult. There are also issues of ambiguity, is Morgan JP Morgan Chase, is it Morgan Stanley, is it Mr. Morgan, Ms. Morgan, etc. So again, if I have data that wasn't designed for integrating in the first place, it's very, very hard to bring it in, okay? And in general, the tools that you will build to collect all this data, some of them will be over new data, some of it will be as Jag was saying, over legacy data, some of it will be over open data because it will turn out that whatever you were collecting, you didn't anticipate you would need data from somewhere else in addition. So you're going to almost always have to start adding other things on. Okay, and that's what really makes it difficult to integrate data overall. So what I want to do is talk a little bit about our experiences not in integrating finance data but in integrating science data, which has many of the similar characteristics. Okay, so over the past decade, I've worked in a number of different scientific domains trying to build general-purpose computer science tools that will help scientists bring together their data. So this has been in things like genomics and systems biology and neuroscience and diabetes and so on. And basically every effort has started by saying, we're going to build a standard, okay? And so you build a standard and you implement a database system and you build a whole bunch of converters and you get people to map all the identifiers and for your first like 10 data sources, this is good. You have total buy-in from the organizations. You have some uniformity you can get there. Then the 11th comes on and they were not part of the original team. Suddenly they have extra data that doesn't fit, right? Then you get some users and the users start to realize what you had in your standard wasn't enough to answer their question. So they want to bring in some other data. Okay, so constantly we've had this problem that your design target keeps changing and the scope of what you need to think about keeps changing. So at that point you have to change your standard, but at that point also you've got lots of data that already exists that doesn't conform to the standard. Okay, so that's the real challenge. You're constantly racing to catch your standard up to the reality of how the data is being used. Okay, and the other big thing is uncertainty, dirty data, assumptions about how the data was being collected and which populations it was collected from and which control groups in certain cases and various things like that also play a factor in all of this. So even when you have the data, sometimes it's just not actually useful. Okay, so again it's a very messy scenario. So I think one thing to say from this is we should also be careful not to be overly optimistic that we can perfectly solve these problems. That being said, I think we can do a lot to make solutions better. So over time what we and a bunch of other research groups in computer science have started to think about is not this idea that I have raw data, I clean it up and put it into the central perfect database, but rather I always have data that's a little bit in flux. It started off raw, I integrated some, I clean it some. Now I'm going to look at it more and I might have to clean it again or I might have to integrate it more or I might have to pull out things. Okay, so thinking about it really is a loop. Okay, so the figure up here sort of illustrates that. We're bringing data from outside, we're also bringing in whatever ontology standards, et cetera, that we have. You know, I'm definitely in agreement that we want to use standards where we have them. We just have to acknowledge that they're a start, not a finish. Do your best to build tools that automatically extract entities that link between mentions of terms, et cetera, that use LEIs, whatever, then realize that still your data may not be perfect. So as users start to query, they're going to get answers. Some of the answers won't make sense. At that point, what you want to do is have techniques where they can clean the data. Ideally, you learn from what they did to clean the data and you kind of loop it back around to improve your data as you're going. Okay, so this basic approach has become one of the sort of pathways forward toward at least what we think will lead to overall solution. It's certainly proven to be successful in the kind of early implementations that we have and there's different pieces of it from, for instance, the IBM group that Tangeeva was going to talk about later, and we have some efforts and deep dive from Michigan and Stanford also was doing things kind of in the space, broadly combining learning and probability and uncertainty and cleaning and realizing that it's not one step, but rather an iterative process has to be a pathway forward. Okay, so just to give you some idea of kind of the still open but areas where we are making research progress, in many ways the first step is really, you know, we have to be able to bring in data from outside. So just to sort of illustrate that, one of the approaches that we tend to take is you've got a bunch of data, we're going to capture it. So here's an example with some concepts and relationships and some data. We're going to show links between them when they're related. As some new data source, new report off the web, something gets filed or crawled or extracted. One of our approaches is, you know, we get some kind of domain-specific extractor, pull out whatever we can from this. So there's a piece there. And then the sort of generic platform will go and look at a library of tools, algorithms, etc. that know about the specifics of finance and distributions of terms and so on. Once this new data is added into our repository, it will try to automatically, for instance, classify, this report might be about Citibank. And perhaps also, client 001 here might be mentioning a client that I know about. Okay, so it's really trying to automatically discover these links. One of the critical things is, these are all based on probability, approximation, similarity, you're not 100% confident. Okay, so from this sort of graph of relationships, one can build query tools where the user asks questions, you get back results that stitch together different components, and now the key is the user can actually look at these results, perhaps do some extra work to sort of see if they make sense, realize that's actually a mistake. Okay, the goal is then the computational system will look at, okay, who told me what algorithm told me that this was a link. Perhaps algorithm number one said these two things are related and it was wrong. What we want to do is now use machine learning to realize, okay, probably everywhere else linker number one gave us something. Perhaps we should be a little suspicious about it. So maybe we also remove the mistakes made by this particular algorithm. Okay, so the whole process is designed to sort of iterate and help build trust and confidence in some clues while learning that other clues are maybe not to be believed as much. Okay, so it's really trying to combine pluggable domain specific tools with sort of general purpose computer science components so that you can leverage lots of common stuff but then specialize the domain expertise for the problem you're trying to solve. Okay, the last thing I want to mention is the sort of glue between everything. I think we've talked about this in bits and pieces, in bits and pieces through the workshop. There's this term of data provenance, sort of tracking for every data item, where it came from, how it came about, and so on. Data provenance is in a way the glue behind all of this. Every time I deduce some fact, every time I acquire some data, having a record of where it came from, why, and how ultimately means I can look at all the other data from the same source and decide if this one was trustworthy, the others might be, etc. So I can use this to determine how much to believe things. I can also use provenance to decide whether actually all these data sets are compatible because I can go back and look at this data item came from this source, this data item came from this source. They were collected under incompatible assumptions, then I can remove them. Basically, provenance gives us a way of assessing data and also, in essence, reversing integration steps that aren't actually meaningful for the queries that I'm asking. So all of these areas have a significant and growing body of research in the database and machine learning and scientific data communities. They're all making progress. I think they're all also at a point where they still need further maturing and so on. But I think they give us a lot of hope toward a way of building toward this sort of vision of standardization without sort of assuming that the only thing that exists is the standard stuff. So it gives us the way of evolving past that. So what we're really after is some way of separating out reusable tools from domain expertise. We want to put those together. We have to acknowledge uncertainty from the beginning. And again, I think we have to acknowledge that any kind of data analytics also has a human in the loop to constantly clean, refine, and inform our system. Thank you very much. So I'm going to give you a taste of the analytics portion from my own research. Machine learning is a broad field. I'm going to focus specifically on learning from streams of data that may change over time because this may be applicable in finance. And despite the generous introduction by JAG, we have done one workshop paper on financial stability and monitoring. Mostly we do algorithm design. And we also study climate informatics, which is trying to use machine learning to improve predictions of climate change. So we're in this big data setting. I've listed a variety of kinds of data that are being analyzed these days, including ones from finance. And these actually pose challenges to standard machine learning. So machine learning algorithms are the foundations of big data analytics. When we get into the big setting, that can mean that the data is vast. There's a lot of it. It also can mean that the data is very high dimensional. So we're looking at a lot of indicators or really large panel data, for example, in the finance setting. The data can be noisy for a variety of reasons. It might be raw, meaning it may not be conveniently labeled for a particular classification or regression task. The data can be sparse, meaning, though, even though we're measuring many, many indicators, really there's some low dimensional space containing salient or relevant information. And I'm going to focus on data that is streaming over time and maybe time-varying and may also vary over other dimensions. When I study climate, the other dimension is typically space. But you can think of analogs to maybe many markets in different countries as your second dimension, for example. Increasingly, the data being analyzed is sensitive or private. It might be your social network data. It might be your medical records and hospital databases. So the research program in my group is how do we design algorithms that are principled with respect to computer science standards in terms of efficiency and performance guarantees that we can prove, yet address these challenges of real data sources. So data streams is what I'm going to focus on today, but I've also looked at learning from unlabeled or partially labeled data for which you want to do semi-supervised learning or interactive learning. Zach talked a little bit about interactive learning. And these days I'm working a lot on unsupervised learning. What can you do with no labeled feedback at all? I've worked on privacy-preserving machine learning, which sounds like there may be some interest in this group, but I'm actually not going to speak about in these remaining eight minutes. And new applications of machine learning, which is partly why I was very interested to come here and learn, because I think financial stability and monitoring is a very valuable area that we should go into. Mostly I've been working on building this field called climate informatics, where we want to shed light on climate change using machine learning. So what do I mean by streams? The data itself may be arriving in real-time. You might need to make real-time decisions. The feedback that we get from the user might be arriving in a streaming fashion. So think of the task of your spam filter. That's a machine learning algorithm that's learning to classify spam versus not spam, but only gets labels when you, as the user, happen to label a message as junk or not junk. Sometimes the data is not streaming, but due to resource constraints, computing on a small device, we're going to access the data in a streaming fashion. In this setting, I have a particular focus on a field of algorithms called online learning with expert advice. So I want to encourage brainstorming here of future applications of these sorts of algorithms. You just need time series data. So an expert is a time series, which we're viewing as each observation in the time series as a prediction, but an expert, despite the name expert, need not be a good predictor. I've been doing this to combine the predictions of the IPCC models, which is this panel, the Intergovernmental Panel on Climate Change that was awarded the Nobel Peace Prize in 2007 for their predictions on climate change. They've been scratching their heads about how do we combine predictions from models from all over the world that make very different predictions, even though these are physics-based models in development for 30, sometimes 40 years, these large, Fortran artifacts of code who make very different predictions. So we've been working with IPCC scientists. You could also take analogies to weather. You always hear about the weather models from different countries. How do we combine them in an ensemble in a way that makes sense? For brainstorming for the finance setting, the experts could be securities themselves. They could be analysts. They could be various attributes of securities. I'll talk at the end about an application to volatility prediction. Someone observing my talk at a previous finance workshop mentioned, oh, maybe you could do this sort of algorithm for GDP nowcasting, where you have the different sort of inputs that are used in other GDP nowcasting products. And machine learning could be then used on top to combine these experts. So here's the framework. We get observations one at a time in a stream. And a set of these experts make predictions at each time. The algorithm, without looking at the true observation, just observing expert predictions only, has to make some kind of combined prediction. Then the true observation is revealed. And this is so that the algorithm can learn as much as possible about the quality of their current predictor. Finally, the algorithm can update weights over each of the experts. And this is repeated over time as you receive observations in the stream. I'll just say briefly, without going into too many technical details, when you have access to a series of experts, you could view these weights as a probability distribution. So your beliefs over which expert is the best performer, when you update weights in this template, you should be doing it with respect to some function of the true observation and the expert's prediction. So you should penalize experts that perform poorly. So you design a loss function. And then different techniques for updating this distribution correspond to different underlying assumptions of how often you expect observations to change, your model of non-stationarity. And so this has been one thrust of research in my group with applications to climate and a new application at the end to volatility prediction. So you're in some setting where you have some experts. Here I've drawn climate models contributing to the IPCC. Maybe we have weather models trying to predict a weather variable. These climate models are trying to predict temperature. You may have securities. You decide what your experts are. But so the standard practice with IPCC for climate model prediction was just to take a weighted average where all weights are equal. If we play this game that I was talking about where at the end of each time interval, be it daily, weekly, monthly, we do get the true observation, then we can update weights where we penalize experts who are performing poorly. Now any technique to update weights over experts in a non-stationary setting, and we do see that financial observations vary over time, is that they have to manage this trade-off. So say model B or expert B had been performing well at first, but at the next iteration, model A performed well. It'll take a while for its weight to contribute to the multimodal mean because model B had accrued so much weight. So this trade-off, we have a convenient term for in the field of artificial intelligence. This is known as the explore versus exploit trade-off. So exploit would be predicting with the best predicting expert currently. Explore would be maintaining a sort of nimble attitude being ready to quickly switch to other experts should observations change. And this trade-off boils down to how often does the identity of the best expert switch. And so within the field of machine learning, this had been a past area that I had studied how to learn from non-stationary data, and we have an algorithm from a long time ago that we worked on that would learn exactly this. It learns the switching rate between best experts, which indeed is the level of non-stationarity in the stream of expert prediction losses. And so this did really well in a climate application. So up is bad in terms of prediction error. We're trying to predict global mean temperature anomaly. Red is the multi-model mean, which is the standard practice using the IPCC. This is in a future simulation. We had a lot of experiments also on historical data using hind casts. But in a future simulation where you clamp one of the IPCC models and pretend it's the truth and remove it from the training ensemble, train from the remaining pool, and of course you have to run this a variety of times, there is some best expert in the pool that's in green, but its identity can't be known beforehand. So the learning algorithm in black does a pretty good job of tracking it. So this helped launch the field of climate informatics. And then you can think of looking at a second dimension. As I mentioned in finance, this might be multiple markets. In climate, this was spatial dimension. And so the idea is we now expect to have non-stationarity both over time and also over the second dimension. Space or in your case, markets in different countries. So we have sort of various techniques, lightweight online techniques that share neighborhood influence. And this is just one sort of hard-coded neighborhood diagram. You could choose how you want your neighborhood structure to look. You can also do a more computationally expensive but Markov-random field approach for such a problem. Both of these regionally based approaches perform better than our original global approach with the Markov-random field performing better, but our distributed online version is a lightweight approximation to that. Now I'm going to shift to something that we have applied in finance. We haven't applied those previous algorithms. Here we're trying to exploit and learn from multi-resolution structure in time. So we first developed this for combining seasonal climate model predictions. And there you have prediction products that predict two weeks in advance, a month and a half in advance, two and a half months in advance, all the way out to 11 and a half months in advance. So at any time you have predictions at 12 different intervals into the future. And you can do this from finance data as well. And the idea is can we transfer information between tasks? So we'll treat the different prediction intervals as different learning tasks. And using a paradigm known as multi-task learning, can we learn these tasks simultaneously and share information between the tasks to improve the prediction at any given time interval? So the application was to understanding market volatility. So implied volatility predictions often have high variance. And we can consider having one implied volatility prediction per each security, say, in the DAO index. And how can we use machine learning to combine these? So our ensemble now has size 30. We're using implied volatility over each security to predict average actual volatility over the ensemble with data from WRDS. And so the similarity matrix, so in multi-task learning, you need to define similarity between your tasks. And so what we're saying is for whatever period you want to predict, say, 60 days in advance, we're going to always incorporate information from the shortest interval, 30 days, because it'll have the freshest information. And what we saw was that there is a parameter regulating the influence between tasks, but for every task, including some influence per this relationship, improve the prediction of the individual forecast period beyond trying to predict it individually. For some tasks, they were diminishing returns, so we would get some improvement. And then as the parameter increased, the improvement would level off. That parameter could be tuned. But for all tasks, we saw improvement by using this multi-task technique. So that's pretty much all I wanted to say. And I'll take questions. Thank you, Michael, for having me here, and to everyone for organizing this conference. It's been fantastic so far. I even have Michigan colors on my slides. I've always used them, and this is the first time I'm actually here to present with the same colors. So what I'm going to do is talk a little bit about my philosophy of handling big data and finance. And I'm a theorist. I'm both a financial theorist and a computer science theorist. So I come from the theory side, and I shouldn't be at this conference given that it's all about data. But be that as it may, what I found over a few years of academic experience is that making theory the basis for driving the way we use data is sort of an important thing. And it also improves the way we handle many of the data problems that have come up in this conference as well. So in fact, data is not only the solution, but it's also the problem. And I think theory is one way of sort of addressing or putting structure on the data in a way to get the right answers that you seek. So I'll try and do a few examples to show you why I can epitomize what I'm trying to say here. And then maybe we can comment on it as we go forward. So I'll skip that. This is just sort of a quick slide to just give you a sense for why I like using theory when applying models to big data. The question is primary, in my opinion. If you don't define the question well enough, you'll be running around a lot of data doing spinning wheels. So that's kind of important to start with. I've also found that simple theoretical models actually make data handling better. In fact, reduces the computation you have to do if your model is clear and transparent. And so that's a goal to be achieved when we actually minimize the amount of wrangling that we actually have to do with the data we have. We can minimize that by using dimension reduction. Dimension reduction can be theoretical. That is, you actually do a lot of the work in economic modeling before you go to the data so that you reduce the amount of variables you actually need. Financial markets are actually incredible aggregators. Just the way we're talking about aggregating big data physically using computer systems, financial markets themselves are actually phenomenal aggregators. And so the theoretical side of economics and finance actually gives you a bunch of well-founded theories that allow you to aggregate in theory first before you get to data. And so you minimize the amount of data you actually might want to work with. So that's sort of an important design paradigm that I'd like to push. It's multidisciplinary now, what we are doing. Finance people are working with psychologists, computer scientists, mathematicians, physicists. And so we can't have to learn each other's disciplines as well. And theories is a foundation for that as well. And finally, data is disparate. I think the rest of the panelists have spoken to that pretty eloquently as well. So what I'll do with that basic introduction is sort of give you three examples to show you how this might actually work well in practice. So I'll talk a little bit about measuring systemic risk and some models for that. I'll talk about some recent work we're trying to do on zero-revelation paradigms for using text to actually manage financial malaise. And if there's time, I'll say a little bit about getting general metrics for market liquidity and also using a large-scale data. The first one is an older project. This is something that was mentioned already on the panel. It's worked with IBM, the labs at Almedin. The group there, all of them are listed at the bottom. Did an incredible job of actually pulling data and integrating it from various media sources. And the source of that data is then used to drive a systemic risk tool that allows us to measure systemic risk for the U.S. financial system. That's the data model in brief. Basically, the data is all public data. So there's nothing that has PII's in it. It's completely based on filings with the SEC and the FDIC. All those filings are sitting in that ecosystem that you see up on the screen. You don't have to sort of get into it in any great detail. But there's a lot of different filings. And so that's a data aggregation problem and an integration problem as well. IBM's group in text mining has about a series of nine papers and about four patents that actually do this job really well. But the idea is very simple. You basically build your data around entities like events, companies, securities and so on. And then you basically write text processing algorithms that go through the entire document, whatever filing it is, and figure out which pieces of that document attribute to which data entity that you see up there. And so you sort of catalog all the disparate data around these different data entities like a security or a person and so on. Once that's done, you basically have your data model and your data integration taken care of at one shot, and then that can be used to do financial analysis. So I'll just focus on one of these entities which is the loan bucket there. And in the loan bucket, the idea was to construct all the lending that was taking place between entities. And so these are loans up to 364 days between banks only, financial institutions. All of these are reported in a loan document uploaded to the SEC server. It's a 150 to 200 page document. It's just text, word files, PDF files, no particular format. There's a text processing algorithm that goes through the whole thing, pulls out everything you need to know about these loans and stores them in a data set. And now we have a data set that allows us to look at the flows between banks in terms of this lending activity. Those flows can then be used to construct a network of how banks are interacting with each other in this lending environment. And for 2005, that's a brief scale-down picture of the network. We've never seen this picture before until we pull this data and text-mind it. That picture shows you that there are three clusters of banks all dealing with each other, but sitting in between these three clusters are the three big banks, City Group, JPMorgan, and Bank of America. Once I have a network like this, I can either analyze the properties of this network vis-a-vis the different entities in the network or I can go ahead and throw extra data onto this network from other sources as well and build up to getting an aggregate score for the entire economy. That's the 2005 picture. I've also got the pictures for 2006 through 2009. You can see 2006 looks pretty similar. In 2007, it becomes one big clump. And then 2008 is sparse because the crisis has hit and nobody's lending to each other anymore. So you're getting a sparse network. And then 2009, you're getting sort of disparate but still sparse networks. So you get some idea of how the dynamic flow of this network has taken place over time. And this is only the lending network. This is not trading in securities or CDS or swaps or anything else. Now, you can take this network and actually do quantitative scoring of the network as well. So that's just a brief summary of some of the scores you pull out of it for each of the years. If you look to the column on the right there, up here, there's a number called R. R is really just the sort of how much dense this hub and spoke network is in a few nodes. It's sort of a simple number that actually comes from expanded theory in computer science. But it's a number that tells you whether a system is fragile. That is, whether a local effect will actually percolate and become a global effect. If that number's bigger than two, in general, you get percolation. The number here is in the hundreds. So the US financial system was pretty fragile by just looking at that one summary number. So that's the way we've taken a lot of textual data, put it into a network, and then done computation of the network to come up with a single number that aggregated what the fragility of the US financial system might be. And so that's sort of the idea where you start with some theory, some build it up, and get to a point where you get a single number that a regulator might be interested in. The number was 137, 2005. It went up to 172, 2006, and then started dropping as people started trading less with each other. You can compute very simple thing that everybody in the room probably knows how to do, which is compute who's the most central player in the network. Turns out it's JP Morgan, Bank America City Group. But when you compute something like Eigenvalue centrality, which you're seeing up there, that actually tells you the relative importance of all these people in the network, and that's sort of important as well. So for example, if we thought the network was describing all flows in the financial system and we wanted to define SIFIs, we could get a rank ordering of their centrality and say the guys at the top of this list generally tend to be SIFIs, whereas the others are not. So that's another practical application for all these banks and networks and so on. Now if I wanted to take credit scoring data and add it to each node in this network, I could actually get the credit scores for all these banks in some form, and the simple equation you see in the middle of the slide there is really combining the network matrix E. The matrix E is just a matrix that tells you who's connected to who in the network, and C is just a collection of the numbers of the credit scores of all these banks. You put that into a formula which is chosen by me in that particular form because it has some nice mathematical properties and I get a value S which tells me what the systemic risk score for the entire banking system in the US actually is. Now that's a nice number because the function I've used is actually linear homogenous and allows me to decompose that financial risk, system-wide risk into each individual bank's contribution to it and then you can plot that contribution out as well. So now I know how much each bank is contributing to that and I can use that as a way to probably levy a capital charge on a bank who might be contributing much more to systemic risk than another bank. So once again, it's theory driven but there's a lot of data sitting behind it as well. I did this in India for the central bank. We implemented the entire system there and that's just a sample picture of what the banking system in India looks like with the bigger, more central banks in the middle. If I had the software running here to interact, if you can actually mouse over and gives you sort of nice properties of all these guys, you can see the fragility in India is only 2.91 and you can compare that to the US numbers which were in the hundreds. Okay, so there's a big difference. The systemic risk score is also computed which is that S score which you can track over time and see how it's operating and you can see which bank is contributing to it. We backfilled that to 2008 so you can actually see the blue line up there on the screen is showing you what the systemic risk score for India was over time, the time series and it's nice because it's like one single score giving you the level of financial risk in the system, it's like your homeland security score, you know, red, orange, green and so on. So the central bank governor can track that and make sure that the system and if it suddenly jumps, you can even go into the forensics and figure out which bank contribute to the sudden jump in systemic risk as well. So it's kind of useful. What you want to do is not take static snapshots of the network, you want to actually make the asset values on those nodes stochastic so you can make the entire network dynamic as well. You can have a completely stochastic network. One of the ways to drive that is by using theory. Merton's model from 1974 is a great model for this. Almost everybody who's in finance knows the model is used for a lot of different things and so you can actually do that and normalize and so there's another sort of set of analytics that comes out of this that normalizes it. I built this into a small little system that takes data real time on all the banks and then actually allows me to plot the networks and then I can use sliders to say between the banks goes up by so much what happens to systemic risk or how does the network look what does the risk decomposition which is given by that formula look like and you can plot it out and so on and so forth. So this is just one way to sort of use theory to drive a big data application in finance. Another application is analyzing email traffic. So if you're a manager of a bank and you want to pick up signs of a large institution running into demise, you could actually analyze what people are saying in your email network. Big banks already do this. Every IT, the head of IT in every bank is actually monitoring emails for cyber security risk. So they're already collecting a lot of this data. All we are trying to do is put on top of it an algorithm that will actually read the data for signs of financial distress. You can do many, many things with it. Now the nice thing about this is you don't have to actually read the emails. So there's no real privacy issue here. It's really a score that's going to be generated by these emails. And so nobody's reading them. And so I think there's sort of less sort of, you know, sensitivity about an algorithm of this sort. So it's a zero revelation algorithm. I took the Enron email corpus which everybody has access to and said, okay, let me just run a prototype on this and see what we get. So if you look at the Enron emails, the upper panel is for 2000 and the bottom panel is for 2001. This is a weekly plot. This is the third part of 2001. So we can sort of see what happened the year before and the year during the demise. You can see that the email frequency sort of went up over time and then they sort of drop off towards the end. The email length remains pretty constant in 2000. Number of characters per email across all the top 150 managers of Enron's emails to each other. We deduplicate this by not counting double in each inbox. But as they get into trouble, they get smaller and smaller and smaller. Okay? We can compute sentiment scores. Yesterday there was a discussion of sentiment. Now, of course, the packages that will compute it in one function call, you don't have to write a lot of code to do this. So we computed sentiment. The bottom line sentiment, the upper one is the stock returns. You can see them tracking each other. And so you want to know what drives the returns of Enron. And so if you regress the returns on sentiment, and email length, email length actually becomes much more significant and the sentiment in fact drops out. So maybe you just need to look at emails getting shorter and that does it. But this is just exploratory data analysis using a fair amount of data. This is a fair amount of emails as well. You can plot the network of emails. The one on the left is in 2000 and the one on the right is 2001. The network on the right is much denser. In fact, it's plotted in such a way that the people who are not emailing much are on the left hand side of the plot. It's like a thumbprint and then on the right hand side suddenly when there's trouble in the year you start getting all those people getting into the business as well. I've got a little tool that I built that takes all the emails and puts them into a little program that will actually you can pick any word you like from the English dictionary. If it's there in the email corpus it'll show you how that tracked over time. And you get some very interesting insights into which words start getting used and an institution gets into trouble. I just plotted the word credit here. I should have plotted the word discredit because that's an incredible plot. It's completely zero until the last quarter and then suddenly it shows up and gets used a lot and then drops off at the end of the year. So you can do that. You can do these plots as well. You can do topic analysis which is now sort of standard fare in anything in text mining where you basically do a principal components analysis of the text rather than the standard PCA on numerical data and you can plot these topics. So what you're doing is basically taking all the words and all the emails and you're trying to figure out which are the broad topics, which are the most important words in each topic and how much of each weekly, in this case monthly flow of email is in each topic. So the blue stuff is actually positive sentiment emails and the dark greenish stuff at the end is sort of negative sentiment emails so you can see the topics shifting from time to time as well. So that's another thing that happens. We did this in India as well. We pulled about 500 news stories a day. I was working with a start-up called Topics, T-O-P-Y-C-S and we would actually be able to put on a map where each of these news feeds was coming from and then topic analyze them and say what are the most important topics, the words in the topic and whether in that topic the discussion was negative, neutral or positive as we see over time and see how the topic flow is changing and whether suddenly there's a big spike in interest in inflation discussion versus growth discussion, versus discussion about the currency and so on so forth so that can drive policy going forward. Finally, liquidity is one of these topics that is discussed a lot in financial forums, not well understood and we can use financial theory so I'm just going to put up one slide. We're interacting with the liquidity from all the ETF trading that takes place. The nice thing about ETF trading is that there's the ETF trading and there's the net asset value of the ETF and what happens is if the ETF is in a market sector that's not that liquid, if something happens in that sector and people want to trade on it they will trade the ETF and not trade the underlines and so there's a big difference, the spread widens because of liquidity between the market and the liquidity should be in basis points so once again it's theory driven and you get a very simple formula that looks like that and that can be run real time so again a little bit of tool. In fact if you want the tool you can email me I'll send it to you and you can just put in whichever sector you want so LQD is the liquid bond market sector ETF and you put it in, it tells you that the current model is not going to be running that extra liquidity okay so once again it's pulling a lot of data but it's sort of solving the aggregation problem by using theory to come up with something that requires only a little bit of the data at a time and so we don't have a lot of compute to deal with as well. Okay so I'll stop there, I thought I'd just give you a few examples but bring this case for why theory is important there's a lot of areas in fintech where I think this is going to be different systems, detecting financial fraud, things of that sort where a combination of sort of good theory text analytics and good data integration will sort of pay off sort of rich dividends, thank you. I'd like to invite questions now and please remember to make sure you get the microphone before you speak. So I guess Michael. I've always been very excited by the network models that you've shown. I'm wondering if you could talk to us a little bit about how you might think about such a model when it's not about interconnections that lead to cascades but about contagion. Could you build how would you go about thinking about building such a model? So it's already I think partly inbuilt in here because one of the things we try and do with the dynamic model is say what happens to all the asset values of the banks if there's a systemic shock somewhere. So since each of the asset values in that model is a stochastic process and they all get hit at the same time we can actually look at what the systemic risk score would be or the distribution of the systemic risk score would be under these different sort of stochastic scenarios and so we get a whole distribution. We can say with a 5% probability we know that the systemic risk score will be under so much. That's one way to sort of just say a tail event is equivalent to contagion and so we can quantify it using the model. The other way to do it is to shock nodes individually and say that a shock to a particular node will then lead to the other node having a crisis as well. That's also theory driven because currently in the dynamic model the connections between nodes are the probability of if this node is affected what is the conditional probability of default of the next node that's derived inside the merchant model so we actually have that quantified from data and that can then be used to shock a particular node and figure out exactly what the mathematical effect on the next node would be and then you can let that percolate through the system. You can run Monte Carlo simulations on these networks as well and these are sort of simple in some sense you can extend the model by saying rather than have just the asset value on the probability of default of each node as driving this entire model which in the current model it does we can throw in additional things on top of it as well. So I guess these simulations let us know exactly how you can also figure out what you need to do to quarantine a node in a contagious situation so you can say okay we are going to shock all the asset values to put them in there that's a bottom quartile of sort of effect which is negative I guess and then say which nodes do I need to quarantine in this network to prevent the systemic risk from blowing up and so those are other things you can do I'm trying to write some code actually that will allow me to mouse over the node and say I want to quarantine that one and then re-compute it's not a technologically difficult thing to do but it's something that will allow an interactive visual field for a regulator also to play with the network it's interesting this interactive stuff I'm finding is really useful for regulation and for understanding what's happening with the data because now you're building the model you can't interact with dumb data it sits there it's what it is but you can interact with a model that's running on the data because then you can see I'm going to take this aspect of the model it also makes it much harder to publish papers because you can't put the interactions into a paper it becomes a little difficult to describe but I think it's much more interesting to do it this way Hi, thank you for this fascinating panel I have a rather prosaic question though and I think it may be best from Dr. Ives for the system that you were describing in terms of taking all the different types of extremely disparate data and big data and then setting up the necessary processes and systems and all of that what is that in terms of just the practicality of setting that up in other words is it really setting up the process and the governance and then actually the amount of manpower and human labor is rather minimal or is it that given the fact that you're working in a world and a domain of constant change and you have to constantly be helping the standard catch up with changes and use your perspectives and use your needs that there's actually a set number of a large number of people you would need to ensure that you're able to keep up with that change and then a second question is is there a maximum bound to the I suspect that you have to scope what you're going to be covering for domain because I suspect the amount of change that happens in that domain there's a maximum to it depending on the number of people you need working on it and then at some point you're going to say alright only 30 different people can be working on this domain with the amount of changes happening in that domain so that's kind of it's a bit esoteric but also comes down to very practical needs of how to set up such a system. Right okay so it's also there's a lot of different levels here so let me see if I can try to break it down so fundamentally what we've been after is trying to figure out how to best balance between what we can automate and what requires human effort I think it's very clear that information integration as a whole cannot be done completely by machine partly because if you look at the data there are lots and lots of hidden assumptions that are not encoded in the data right a lot of the ambiguity I was talking about is because people just didn't document the assumptions right when I see dollar it doesn't say what particular countries dollar right so there's a large part of it which is actually just trying to understand those sorts of things there's also you know sort of mundane issues like file formats and so on and I see those as you know somebody does those once it takes a little bit of time that's not in a sense the hard problem the hard problems are mostly for instance I have a new hypothesis I need to bring in new data in order to do that who does the work what we've been trying to work on is techniques where that can basically be in some sense scaled out to the community so not just having sort of central people but a lot of the work is done by the person asking the question or the person contributing the data and that's sort of why we have this combination of machine learning, semi-supervised learning along with this is there are various algorithms that we can use to predict that these two names look like they ought to be the same person we can potentially predict that these two fields probably mean the same thing a human will need to vet this at some point right and what we've come to believe is if you rely on a central set of administrators they don't scale to all the questions in the entire community but on the other hand the community scales right to all of the questions because it's the people who are asking the questions now we're going to rely on them to put in some effort to make sure they get quality results so you know in a sense this is kind of along the spirit of crowdsourcing there's a lot of these other efforts where you're trying to figure out how to use the expertise of the many humans to do it in many ways this is closer to what's sometimes called expert sourcing right I actually have users who are asking questions they're incentivized to make sure that they're getting quality results so they're willing to put in some effort to get quality results they know something about the data and what they should be getting and so if we can use their expertise map refine integrate the data then we can potentially use their effort to help the next person leverage their data and so on so that's really at the core of it there's constantly this balance between reusable things and sort of special purpose things and I think one of the sort of active research areas is how far can we push the notion of automating the process and removing as much as possible of the humans effort without adding a lot more mistakes or hidden assumptions or whatnot yeah I just wanted to say something a little bit about that the initiative I mentioned this thing called the financial industry business ontology is in fact a process that is in some sense much like that I don't think of it it's not right to think of it as okay there's an existing standard that exists in space it is in fact a process it's got governance rules it's an institution it like trust me the data guys on the finance side understand that their world is constantly changing and no system that isn't designed to be able to evolve is ultimately going to work so part of to me what's exciting about that particular initiative is it is all of that and to a certain extent it is very much sort of related to this in the sense that there are groups of experts that work in particular areas and when new things come you know they do a draft and then they take it out to the institutions to check it's not it's robust but before it's ever sort of incorporated into that so that there is an awful lot of that that that whole initiative has many of those things involved in it is really the process of how do you ensure that you've got a standard that is that is living in a way that responds in that way I guess the other thing I want to just kind of a point I want to make is in some sense the the problem in finance is in some ways I think easier than other problems at the core of finance you've got contracts the basic unit that we're talking about here is a legal document it is concrete any of us who have been involved with OFR probably have our own personal favorite mark fled paper my personal favorite mark fled paper is the one where he talks about an evolution in law of how you write financial contracts as code precisely to get around the ambiguity of enforcement and all those things and the point is we are talking about a universe where that's possible that's not crazy in finance and I think that does mean for the core of the problem it is in some sense easier now obviously there are lots of ways in which you want to go beyond those contracts to bring in other information and I totally get it that you want to be able to do that too but I think it is worth remembering that the core problem we are dealing with is actually one that's pretty concrete it has a certain degree of structure about it that allows you to I think go further down this road than you might in other spheres there was a question in the front here get the microphone to the front please to the front right here so when it comes to the legal entity identifier how specifically mergers or acquisitions or splits of companies how useful is it to convert legacy data my intuition is that when there's a merger or split we are introducing discontinuities because basically at least one of the entities involved becomes a different institution or becomes subject to different policies and different management so to use the terms by professor Montelioni it's like we will have to we will end up penalizing the experts from the disappearing institution has this been assessed empirically I don't know exactly the answer to that particular question but as somebody who was kind of involved in the early states of the LEI initiative let me kind of respond it it was never designed to solve every problem it was designed to solve the first problem we started with a world where you did not have a way of distinguishing consistency consistently the legal parties to individual contracts and it started as simply a way of saying let's make sure we can consistently track the particular legal entities that are involved in contracts now to the question of mergers or if we're talking about mergers then the question of who is the legal party to the contract there is an answer to that that comes from the merger if you're talking about a split it's sort of the same thing you may have to create a new number for the new entity that is created and as was noted in the introduction to this like just being able to identify the individual legal entities does not deal with the problem of who owns what right and so there are all sorts of things that have to be built on top of it to have it be a fully articulated system but the point is unless you have that first step you can't build those hierarchies so in some sense your question is really about how do you take that foundation and make it as fully useful as it might be and again I don't know the exact answer to your question but those are all things that Matt Reed could stick his hand up and answer the question if you'd like the rules of the system are that the the organic rules of the home jurisdiction apply to identify whether the surviving entity is indeed the previous entity simply with the new subordinate entity or an entirely new entity if it's the former then you retire the code for the target of the acquisition and the emerging entity would maintain its code if there is a new entity that gets created it would get a new code but there's a pointer back and forth but that's an easy answer it's in the details that that becomes challenging just to add to that point what happens is there's a specific method that is there in terms of managing entities as Matt just described and then when you have to do analysis on top of a data set where you have entities that come and go even though you know what the rules are based on which the entities come and go actually folding this into the analysis is not straightforward if you end up drawing a simple visual of any quantitative metric you have to understand what you're doing and understand why you're doing whatever it is that you're doing I'm Oliver Goodnuff I'm at Vermont Law School and also connected with the OFR I'm a co-author of perhaps that paper and two things if we're correct that legal specification whether contractual or regulatory or legislative is essentially itself a computational exercise that we are specifying a process and all those things that meets computation one of the things that computation asks us to do is to cut out all the things that could happen to the world into the things which are salient to a computation so in effect we're turning this exercise you're describing inside out as it were by saying instead of having a mass collecting exercise now we've got a mass simplification and filtering exercise and I was particularly struck Sanjeev with your saying here's some numbers which we can just collapse down and they're really good predictors that's what I want to write into my contract that's probably what I want to write into my regulation rather you know if I'm trying to have a relatively discrete set of computations that are going on in this are there techniques that you would think of that would be computationally useful is this recognized problem of how to reduce your input alphabet as it were how to reduce your event space to things which really are salient because in effect that is if I'm writing a good contract is one of the things I'm going to ask to do if I'm writing a good regulation probably as well I guess I would what I was going to do was introduce a note of caution here one of the things I've done I have an interest in history as we kind of indicated earlier and I honestly think when you're thinking about financial stability one of the things you have to do is think about how these issues have evolved over time and the challenge in some ways is just the sheer pace of change here and to a certain extent you know when you start talking about simplifying and what not in that kind of way it's important to recognize how dynamic it is and so I would just say I totally get the point that it's useful summary statistics well designed are useful things but it gets complicated when you're dealing with something that is evolving as rapidly as financed and so I think there's been a lot of interesting thought in the question of effectively how do you predict systemic crises and there's one version of it that's essentially based on asset prices there's an awful lot of theory that comes from asset pricing that one would think ought to be informative of that question the problem is that when you look at financial crises what often happens is a period of kind of irrational exuberance and you look at how risk is priced in the run up to those crises and it kind of doesn't give you a lot of confidence that those are sending you the right signals now as I've understood that literature one of the things it can do is make distinctions so for example the work I've seen in the run up to the last crisis it did kind of point to the particular institutions that were more likely to be problems than others but if you ask the different slightly different questions which is were those indicators telling you in 2005-2006 that you were on the brink of something really ugly the answer is no for the most part and that is just generally reflected in how risk was priced in the run up to that and so I get queasy a bit on this I think it's seriously it's very useful to the kind of way I think about the data and from that perspective the question of what is it that really matters for the contracts is sort of the central question and that's in some sense the set of things that you have to capture in your data system and I'll I'd like to actually add to that I think that in many contracts there will often necessity be things that you have to define as events that require human judgment so if you have a contract that has a clause about an act of God you're not going to run a set of numbers to determine whether what happened was an act of God or not right but if you've got a few things like this that you've said is independently determined outside the system as a value that you then input to your computational model that opens the door to the complexity we need right right just to respond to this question about a single number whenever you produce numbers of this sort they fall into a class of metrics and if you define the class of metrics properly with some required properties every time you define a metric and it meets those properties at least you're sort of confident that you're meeting those so for example, value at risk everybody uses it now there are four properties that over time the finance community has said a good risk measure should have and value at risk actually violates one of them it doesn't meet one of those but we are aware of that and we can take that into account at a meta theoretical level at which we can say what a good metric should have even if it's a single number and then you sort of look at anything you produce whether it falls in that category I completely agree with Louis that one number is not going to cut it you need to have a bunch of these and the other thing is if you look at data from the financial markets say correlation matrices and you see them tightening it's probably already too late that's the other problem so you want to look at physical processes maybe below it but the other problem is actually getting at the mechanical structure changing before you actually see it showing up in prices and risk numbers that might be a way to get it we don't know for sure we have to do more work to try and figure out whether it's predictive enough can I do one comment on that it's crucially you have all the links so if I think back to the crisis one of the crucial pieces we were missing was AIG's exposure to the Moger universe that was represented by AIG and part of the problem is you don't have that in those numbers so what my idealistic dream about the number we actually need which is in regulation but is not yet implemented is CVA which is counterparty risk adjustment because that tells you exactly what your exposure is to every other counterparty across all the markets in which you deal with them and all the contracts you have with them if we have that network that is what you need we're out of time but I know that you have a comment if it is related to specifically what we're talking about I will take it yes it seems to me there's an inherent friction here between raw data which is not clean and doesn't fit your models and a simplified structure in which everything is neat and here's an observation and I know what it is and the challenge to me is whose responsibility you were talking about mergers anybody who's actually looked at mergers knows that there is no definition of a merger that would be commonly agreed upon the FDIC and the Fed have two different taxonomies as to which bank succeeded which and they just have to do not the same or what's a commercial bank FDIC and the Fed have a different view those are fundamental questions so there just isn't a clean answer on these things and I think the challenge here is a lot most or many researchers are irresponsible they don't want to deal with that sort of stuff they say just give me clean data and you solve all of those problems for me and yet as chairman greens and there's an incentive in central banks Greenspan used to complain the staff sanitize everything because they want one story because anomalies you know that's just trouble for you you got to try to explain it so you suppress the anomaly he used to say there is no information in the consensus it's the anomaly that contains the information so what's the role of the data person and what's the role of the user so that's a great question and I don't know that I can fully answer it I mean I think this is actually one of the things where we are always wrestling with as a community when we're looking at data right is the act of cleaning the data might actually be biasing the data in some way right the question of who's accountable I mean at the end of the day it ought to be that the community holds the person who's computing the result responsible for ensuring that it's valid I think that's what we would like to be the case it's not at all clear to me that is the case today I think this is actually one of the open questions right I also will say from our work with scientists there are a ton of cases where what's clean and what's relevant is extremely tied to what question is being asked so the notion of trying to make clean data agnostic to the question isn't itself almost flawed right and that's part of where we actually have to make some progress is really starting to understand these kinds of issues and that's why you know at the end of the day I think it's the person asking the question we'll have to have the expertise in figuring out what's actually meaningful there so with that we need to bring this panel to a close thank you all for your attention