 Hello. Hi. I'm just going to give it another minute or two for people to join. And then I'll kick off. Okay, Dave, are we all set to go? Yes, we're good to go. Okay, great. Well, hello, I'm Tasha Elson, I'm the CEO at Finno's. And I'd like to thank you all for joining us this morning at our first menu summit. I hope you've enjoyed the conference so far and that you find these next sessions interesting. In my next very brief 10 minutes, I'm going to give you a little bit of an introduction to Finno's, as well as a very, very high level overview of the financial services market, and why we think there's such a great opportunity for collaboration through open source in the industry. So as I mentioned, we're Finno's the FinTech open source foundation, and we are the financial services vertical within the Linux foundation. And our aim is to help the financial services industry leverage open source software standards services and policies to solve industry challenges together, and really to innovate as well. We are a membership organization, and our 36 members of both corporate and associate members really do represent the community and the different participants in the financial services industry. We have large technology we have large banks large investment banks and technology firms both small and large, who really know financial services and open source. And we have associate members who are nonprofits, other nonprofits, government organizations academic institutions whose goals are whose our aims for the industry are aligned with ours. We have 37 active and incubating open source projects standards and special interest groups which is new for our foundation and a growing number of contributors focused on our our Finno's projects. So, it's a little bit harder than you might think to actually define financial services. It's a term that can be a little bit vague and, and quite encompassing for lots of different areas. So I've broken it down here at a very, very high level into banking investment management and insurance insurances sometimes included when people talk about financial services and sometimes not. Currently, it isn't one of our focus areas, although certainly a lot of our projects and the work that we do is applicable to that industry as well but really our core focus since our inception. At our inception was around investment banking, but we certainly are seeing that broaden as we become become a more mature foundation. So in any examples that I give in in the next few slides, I'm going to focus on investment banking, which is one part of banking there's also retail banking commercial banking and, and in each one of these categories you can drill deeper deeper deeper into specific areas and services. I'm going to focus on investment banking because that's my background, it's what I know. And certainly we do have a lot of projects that have come from that area of the bank. One other thing to mention before I move on is that this, this ecosystem gets very complicated. Different banks even will refer to these different areas with different terms and names, and they might cover parts of the banking and parts of investment management and, and really there can be a mix of the services that they provide to their end users. Within all of this you have branches and subsidiaries and counterparties. And so sometimes even doing something that seems quite simple like a single view of your client can be complicated and that can be complicated because you can have silos within within the different areas, and then even looking across areas is, it's a large amount of data to manage, which leads me to a little bit more about the volumes and the size and breadth of the industry. And again it can be difficult to actually define the value of the global financial services market but one report puts this value at $22 trillion. So that's pretty big. A few more stats is that who you'll hear about later puts the gross market value of OTC derivatives at $11.6 trillion. The BIS, the Bank for International Settlements reports that trading and FX markets in a single day last year reached $6.6 trillion. That's in one day. And also last year, the CBOE, the largest options exchange in the US had an average of 7.2 million options, options contracts traded every day. And this is just a few data points, you know, there are many, many more transactions and functions that happen, you know, including payments on the retail side, which really is a huge, huge market. And it's a critically, it's critical financial infrastructure, you know, it's recognized by governments and policymakers as a fundamental part of our economy and keeping nations running. So if we look at one of the biggest players in this, in this financial services ecosystem, it's banks, and you know, they're just in the US, the European Union, and the UK, there are more than 11,000 banks, and these banks and I'm going to use the term a little bit broadly because you can see that it actually in the European Union, they refer to them as credit institutions, the PRA, I've taken the regulated banks, building societies and credit unions to put in that number. And so those 11,000 banks employ more than 10 million people, again, just in those three markets, although they are reasonably big markets. And then to bring it down to a slightly more specific scale, something that that may be a little bit easier to grasp. We can take the example of one large bank and it does investment banking and corporate banking and retail banking as well. And they employ 250,000 people globally. And 50,000 of those people in technology. And that technology division has a budget, had a budget last year of almost 11 billion dollars. So, although they're not technology companies, a little bit, they're technology companies. And they, you know, banks have always understood that technology is important. But with recent changes, there's certainly more that they have to do to take advantage of new technology and you can, you know, that's clearly recognized by by that level of budget. So what are those 50,000 technology people in technology professionals in this one institution working on. Well, quite a lot of things in a bank, you can have anywhere from four to 7,000 applications, and those applications can be that that are developed in house so there's a lot of proprietary software, or supported maintained vendor software, and in actually open source does underpin a lot of those processes and that and those applications as well so in those, you know, thousands of applications within a single organization. And some of the, some of those those systems are meeting very specific business needs, maybe it's for the FX trading desk or the sales people are for IPOs, you know there's really a very broad range and then there's of course the levels. And underneath that of legal and compliance how you do your KYC is your AML your risk management your financial control your regulatory reporting. And then if you take it another step down there's there's common infrastructure that goes across the board to and, and things that you might not think about the corporate real estate has apps software that they need to help manage that you know the banks and the spaces that the bank uses so there are really lots of different areas and functions that they serve, and I have to move along quickly now. So what what does this all mean, you have a big complex industry that provides essential services. There are thousands of technologists working on lots of systems, many of which are solving the same business and technical problems. And what that means is huge opportunities for open source collaboration. So, you can see some of those. So anyways that the financial services industry is collaborating in our landscape which shows you all of our different projects standards and things. And today, you're going to get to hear about a few specific ones. So that's what will be happening through the rest of the presentation and at this point, I'll just hand over to my colleague, James McLeod, who's the director of community. So I can talk a little bit about what some of the challenges are that that you have as you move from, you know, internal working internally in a team and a project to an open source environment. So with that, James. So thank you very much, Tosha. And thank you everybody for being here this afternoon at the Finals mini mini summit as part of OSSE you. I'm James McLeod director of community. And this afternoon I'm going to take 20 minutes of your time to talk through the challenges opportunities of building open source communities in a highly regulated financial services industry. Thank you. Looking through the people who are actually here with us with us this afternoon I can see that, you know, we have many engineers, and we also have people who are very familiar with open source. So it's going to be, it's going to be actually very easy to, you know, communicate that we are living in a world in an instant world of open source technology, where there are products and solutions and ways of methodologies of working that you know, allow us to develop software very fast in a very agile and fast paced environment. However, within financial services. It hasn't always been like that. So from my background, as a financial services engineering leads. I know that there are many teams out there who have been working in a very linear way. So, you know, following almost like an industrial revolution production line methodology of development. You know, your stories or your features and passing them down the production line, all the way through to production, only to find that there's an issue there, and then swinging that particular feature all the way back to the beginning. However, teams are actually moving forward from that now. And I'm really pleased that I'm digital transformation and you know the move towards agile and DevOps has actually taken us to a place where teams are actually moving more fast you know we've got feature teams within banks who are actually sharing you know stories within their feature teams and working together to remove all of the blockers that slow us down from day to day. However, despite the fact that we are actually starting to deliver our features to our banking customers quicker. Moving from you know the corporate goals through the production team so all of the agile ceremonies that we actually go through all of the various different you know DevOps methodologies that we go through passing those containers through you know test uat and then live into production. We are still finding that there are engineering and team silos, forming even around those processes which do speed that team up. So even though you know we're getting, you know prioritized features coming into teams, you know those same teams are actually fixing with their defects and changes so they're coming in from you know product owners and other people. And quite often those teams are actually running the infrastructure, you know whether that's cloud or whether it's you know other form of mainframes. We're fine, you do actually find those teams are still you know very much focused on their specific tasks and on their specific projects, which means that you know we still have duplication of effort running between teams. And although we're communicating very well within our teams, communicating outside of our teams can also be limited. And quite often, you know, as we're looking at items to deliver to our customers, which are very feature and story driven, you know, sometimes we just want to get the job done and so we go through a lot of you know reinventing and re learning the will. From an industry point of view, if you take that on a global scale, and then you multiply that across all of the various different financial services, you know firms and you know companies that we have across the globe, all of those inefficient inefficiencies then get multiplied within all of the various different banks, whether it's retail whether it's investment, you know, maybe even you know FinTech firms finding, you know that they're creating inefficiencies as they start to get bigger. The problem of you know how we actually go from you know that linear process of development into DevOps and then create even more inefficient even more efficiencies out of inefficiencies still needs to be resolved. So, how do we actually do that. Believe it or not, a lot of these problems are actually solved by connecting people and teams together. So the connections between people and teams remove those silos that lead to software development inefficiencies. And so a lot of the processes and arrows and you know all of the various different technical documents that you write, you know that look like they're joining things up. It actually comes down to the human side of connecting people. Now, in order to do that it's actually relatively easy. Now you do need to bring you know your banking stakeholders along with you, you know because you know there's a lot of when you're actually changing things inside an existing structure. You know you need to make sure that the people who can advocate that change, you know are on your side. However, they don't have to be massive corporate changes, you know it can be within, you know, a couple of teams who are opposite each other. And ways in which you can do this are as simple as getting people together for lunch and learns, you know that you can actually do virtually like we're doing here today. And so you might get together as a scrum team every day to talk about your problems, but can you actually visit another team and ease drop on the problems that they're having, and then you know speak to them about how you can help them solve their problems as well. Then we also have the wider open source project meetings and also open source repositories you know across GitHub and get lab and, you know, other places where, you know, technologists and engineers hang out. And then you've also got the online wikis and online forums that you can take part in as well. And so that's ridging the gap through communication, that's not even sharing code. Then, once you actually become very familiar with that, there are also ways in which you can bring those communities together so you can push the boundaries just a little bit further, but from within the boundaries of your corporate firewalls, so you don't have to venture outside of your walls in order to do this. And that's called in a source. So in a source enables internal team collaboration within the safety of your corporate firewalls. Now this is where your infrastructure teams and all of your various different stakeholders will need to get involved because this is sharing code. This is bringing a system like GitHub inside the corporate firewalls and then opening up repositories, you know, and shared libraries for people to start collaborating and solving problems together. So in a source and GitHub, you know, within the corporate firewalls are example mechanisms that enable internal collaboration. And that's as well as, you know, attending meetups and getting involved in the very, you know, people to people aspect of, you know, growing your communities, sharing your learnings and helping each other. So once you actually become familiar with the internal workings of what you need to do in order to collaborate as a team, open source communities develop and improve software through code contributions, ideas sharing, software fixing documentation writing and continuous education, but this is on the outside of your corporate firewalls. So this is what we're actually doing today, you know, being an OSSE you were actually breaking outside of our corporate boundaries and we're coming together as a big joined up community in order to start developing and sharing ideas and learning together on the outside of our corporate firewalls. Now, as you can see through this, this document, this is where Finos actually, you know, comes into, into light. This is where a lot of, you know, the advantages of being part of an open source foundation. This, you know, here to service the open source community for financial services comes into play. We're a very diverse and with a very rich community of people across, you know, the financial services landscape. And we also have technologists, you know, who actually represent and develop a lot of the DevOps tools that we actually use in order to speed ourselves up, you know, as an engineering team. Plus, we also have technology companies, you know, who have an external perspective of how to solve engineering problems that we can actually bring into the Finos community. You know, that is an open source community to educate each other across the globe across this diverse and rich community, how to do things differently, you know how to solve those problems, how to bring that problem solving capability together. And then also, are there any projects that have been in a source to maybe from within your organization from within your financial services organization, or your technology company that would also be of benefit to other people on the outside of your corporate firewall. So take for instance, which was contributed into Finos is an example of a open source project that was actually developed by a Deutsche Bank team on the inside of Deutsche Bank, that allows engineering architects and, you know, other people in the technology area of Deutsche Bank to log and also communicate the topology of their internal architectures. The Deutsche Bank team, for example, and this is just one example of the Finos member, thought that it would be a great idea to take wards and then contribute that into open source. And so they did. And so they brought that into Finos. And then we also have NatWest markets who aren't a Finos member who are actually using wards on the inside of NatWest to architect and then log the topology of their internal infrastructure using an Apache to licensed open source project that is continuously being updated and modified between NatWest markets wards and other people who are using it. Plus then we also have, you know, special interest groups that come together in order to share ideas. And we also have project teams who are using GitHub and GitHub issues in order to communicate problems and ideas and features, you know, across, you know, the global landscape of the open source community. So I'm from the UK and Tosha is a colleague of mine in the US. And this is how Finos actually works. It bridges the gaps between teams, both within the same firm, plus also across multiple firms within the Finos membership. And so Finos supports and grows a safe and trusted open source community for the financial services industry across the financial services landscape. Now, the Finos community also bridges engineering team silos by solving common financial services technology problems from the collaborative open source vantage point. And so where in a source looks at the problems that you're trying to solve from the inside of your bank Finos and the Finos open source community does that from a global reach. So we have two projects that we actually run within Finos as part of our open source community that not only bring people together to you know talk about their problems and help solve them. You know, as we do within our special interest groups and as we do as project teams, but within Finos we're actually joining things together through all of the various different mechanisms of CI CD as people will be familiar with. Through our open developer platform, which educates financial services engineers on how to be an open source engineer using safe and controlled mechanisms of CI CD development. And so this is where the community aspect of open source and the learning and the bridging the gap between people to people. This is where the rubber hits the road, and you actually bring all of those communication skills together in the open developer platform in order to create that CI CD. You know, how do we take an idea from, you know, conception through to delivery through writing, you know, feature stories through having project meetings from being being able to build, you know, our various different projects, right through to how we actually document within the repository of our projects and then build a microsite for other people to follow, and for other people to read our project documentation. ODP brings together all of that technology that you need to be a successful engineering team with all of the various different, you know, communication that you need in order to communicate as a project team as well. We know within financial services that we need to be careful, we need to be safe, we are regulated, you know, you know, we have the watchful eye of the regulators on us. And so we also provide a project called open source readiness, which bridges the gap between how we engineer and how we deliver features with, you know, governance and the people from within side the banks who need to be, you know, safe and secure, that we're actually doing open source correctly, you know, and even people who haven't got any experience of open source can join open source readiness, because we bring speakers from across the industry into open source readiness to tell their stories of how open source has benefited them, and how open source can also benefit you, and all of the various different, you know, checkboxes that you need to go through to take your firm from not being an open source through to being an open source firm. Plus, we also talk about inner source as well because we understand that collaboration between people on the inside of the bank is just as valuable as collaboration between people and technology on the outside of the bank. So Finos being the FinTech open source foundation provides a safe place to share code and host repositories, plus provides all of the various different tools and mechanisms and people who have the experience in order to do this too. Now, once teams actually start joining Finos, and once you've crossed the threshold of, you know, joining that first call, or, you know, going to github.com forward slash Finos and finding our open source organization in github, you'll notice that we have a lot of valuable projects within our repositories that are actually there for the benefit of everybody within financial services. And also for the benefit of people, you know, who want to break into financial services as well because the Finos community actually helps people who want to become a financial services, you know, engineer, but haven't got that experience and would like to join a team. So, without focusing too much on our projects, I've just put a few examples, you know, up on screen now. So if you want to go to Finos, the Finos organization on github, you'll notice that Goldman Sachs has just contributed legends. We also have perspective for any JavaScript engineers or anybody who builds web assembly you can go in there and see how charts and graphs are actually built by the JP Morgan team. But we have cloud service certification, which we're, you know, putting a lot of energy into in order to create regulated regulatory compliant infrastructure as code across all of the main cloud infrastructures. We also have FDCP as year in AWS. And we've got people from within banks and people within technology companies, helping us with that. Plus, we also have a very active and also successful standards called FDC3, which is about passing across the trading desktop and how you actually do that as financial objects in code. So if you want to join a standard FDC3 is absolutely a community that you should get involved in it's very active and it's very popular. So drawing to the end of my talk. Hopefully, I've convinced a few people here to actually join the Finos community and give us a shot, you know, so if you are within financial services, or if you are outside of financial services, please come to Finos and get involved. And there are multiple ways in which you can do that. You can come and evaluate our software, you know, so just come and take a look at GitHub and, you know, evaluate locally. You can consume our software from GitHub as well and use it, you know, it's Apache T licensed, so you're fully entitled to do that. If you can join a project call or you can take part in any discussion through GitHub issues, you're absolutely more than welcome to do that. Then you can start contributing code back into our repositories and become, you know, a member, a contributing member of a project team. And then as you kind of grow within Finos and within that Finos project. Maybe you would like to lead one so I know that we've got some lead maintainers, you know, coming up later within the mini summit that you'll hear from. So that can be your goal to become a Finos lead maintainer. And there's a link there for you to follow as well to help you so go to HTTPS Finos.org get involved to learn how to get involved. And that brings me to the end of my talk and so feel free to come and find the community. We're on GitHub, we're on Twitter, we're on LinkedIn and go to Finos.org. And with that, I would like to hand over to Rob Underwood, Chief Development Officer of Finos for our next section. Thank you very much. Great James. Thank you so much. Appreciate, appreciate everyone being here and, you know, what a great start to the day, at least for those of us on the East Coast I know folks of you in Europe it's midday so I appreciate everyone being here. I think we have a really great panel coming up and, you know, I'd like to say, you know, for good friends now, who have been part of the journey which will touch on a little bit in terms of the sharing of legend and some of the work that we've done with, with ISDA and with Ragnosis as well. So, would I further ado I'd like to introduce our panel. So, we have four folks to from Goldman Sachs one from Ragnosis and one from ISDA so maybe we could start fee could you just maybe give a quick introduction of who you are and what you do and maybe a little bit contrasts the context that you work in with the area that Pierre does. Yeah, absolutely. So, I've been at Goldman Sachs for 10 years. And the first six years of those I actually spent supporting our derivatives businesses in their trading. And the last four years or the most recent four years I've been within our data management so looking at how we build data models for our financial data and the governance around those as well. And so sort of that distinction between myself and Pierre, looking more on sort of the business side and how we bring that business and engineering connectivity as we look to bring data models into the industry. Great. And Pierre, would you like to introduce yourself. Hi, I'm Pierre de Balin. I'm managing the data model engineering team in Goldman Sachs within data engineering. You know, compared to fee I'm more an engineer like I'm actually coding day to day manage a big team that actually work on tools like legend, but also on tools like data governance but I could also day to day and already enjoy actually programming. We're working in partnership with the business with Finnazos to make sure that we can achieve the banks goods. Great. Thank you. And Nigel from Ragnosis, would you like to say hello and maybe tell us a little bit about yourself. Yes, thank you Rob. So as Rob says I work at Ragnosis where I'm the senior data modeler with my primary responsibility being to partner with ISDA on the development of the CDM. Prior to Ragnosis I was also actually at Goldman Sachs, where I was for 20 years and worked was working with fee and set up the operations data modeling function prior to departing moving across to Ragnosis. Fantastic. And Ian, would you like to introduce yourself. Thank you. Yes. Good afternoon. Good morning, everyone. Yeah, Ian Sloyne, Director for Market Infrastructure and Technology at ISDA. Prior to that I was the AMIA lead for data and reporting, dealing with a lot of implementations and regulations. I've never worked at Goldman Sachs, but I have worked at other banks looking at technology projects, regulatory projects and always in the area of derivatives and how we represent derivatives data. I have seen you at 200 West Street so I know you've been to Goldman Sachs before so. I think I've been on a few of the offices. So for those of you who don't know, I know we've alluded to it a little bit and James mentioned it before. Legend.finnose.org. And we'll talk a little bit about it through throughout the conversation today but it is the modeling language and a modeling workbench and suite of tools that Pierre led the development of within Goldman that's now been open source into Finnose. And all of the people on the panel had a hand in the work, the pilot that we just completed and then the open sourcing effort that just concluded. And a big part of a legend is data modeling and I know Pierre will clarify some of the use of that term too but let's just jump right into it and kind of get into some of the motivations of perhaps the open sourcing of legend but also some of the larger industry issues. Just going to open this up to all four of you. What are some of the current issues with regards to data in the financial industry like why is why is there this focus why is there this discussion on data and data models and common data structures and for the functional programmers in the crowd maybe this looks like type definitions and for the object oriented programmers maybe this looks like these are like classes but whatever your your bent is why is there all this focus on data in the financial industry. Ian maybe you want to start and then then Pierre you can add a little bit and we'll turn it over to Fia Nigel from there. Yeah, I'm going to take a very high level approach because I'm from the trade association and far from any engineering background I ever had. So thinking of is the swaps and derivatives association. So we come primarily backgrounds in derivatives markets. Not everyone thinks of trading interest rate swaps when they open up their stock broker. But but huge market huge huge numbers I think Tasha called out some of the some of the statistics but you know that the interest rate swaps market is around, you know, 500 trillion US dollar notional outstanding at any time. Those those interest rate swaps tend to be, you know, of a maturity profile which means that the contracts might be in in force for many years. You know, of those of those, you know, perhaps 40 40% are kind of short dated one year. But beyond that, you know, there can be anything out to 25 years. The reason being, because these derivatives are underpinning, you know, commercial contracts but also, you know, retail mortgages and risk management over over, you know, pension funds and things like that. So to give you an idea that's the kind of the markets we're talking about. And obviously, with that managing the data for those markets, you know, you have to you have to manage the payments. You have to have to happen every every quarter every month. You know, there's many kind of obvious things to come to mind. But there's other things that that maybe the audience isn't aware of. But you can probably guess from from from the terminology, there's the exchange of margin on the back of these contracts. They have to be cleared. There's other risk management processes and finance and accounting processes that have to happen. And now I'm going to, I'm going to rip off a colleague of Barclays who has said this in public that just one interest rate swap at Barclays, in particular, they did some analysis and they found that it was stored in 15 different places just just when the trade was done within within 24 hours. And that was Dr. Lee brain to give them the kudos there for that analysis. So 15 different locations to manage a trade to his life cycle. That doesn't seem very efficient to me. But that is that is what happens today. There's high values at stake. And the chances for inefficiency and mistakes are obviously there. So how do we mitigate the problem of having the data in all those different places? Well, we've costly reconciliations. And that's how the problems have been solved in the past. True. You know, there are different ways of storing the data, different standards being used or or variations on those standards. There are other problem by reconciliations between systems and those that doesn't really have any value to the to the to the market. And to give you an idea of the cost of these businesses, the CDM project which I wrote, we set forth to a couple of years ago to produce standards for processes that happen to trade through their life cycle. And I'm quoting from a Deloitte paper, which is on the future of post trade, which I'd invite people to take a look at. But we were just sort of, and I believe these are very conservative numbers, but we believe like through the CDM and its implementation on technology, we could, we could see savings of $1.5 billion on the automation of affirmation confirmation, and processing at those processes are, you know, pretty much as they sound, you know, very basic data management processes, maybe presenting the data in a different way for illegal confirmation. It's crazy they are manual as all I would say and having been close to many of these processes in the past, it was always a frustration that we couldn't automate more. You know, and, and, you know, basically in a very laid terms, you know, it's just the case of saying is, is this your trade data is the same standard form. Do you agree to it as a client, and then can we provide it to the regulator. So, just to touch on briefly why some of the reasons for for this state of affairs. Personally it is though we wrote a letter to to the financial stability board. I ask all the international organization for securities commissions, and the Basel committee on banking supervision. And together with the other 700 trade associations. It is our commitment to moving to a digital future for financial markets, but the opening line of it really is the is the is the area and calls out kind of some of the reasons why, particularly for derivatives industry. I believe that the G 20 financial regulatory reforms which are introduced in the wake of the financial crisis about 10 years ago. And fundamentally altered the traditional operating structure of bilateral financial markets firms have implemented these regulations and associated requirements on top of existing infrastructure placing significant demands on it. And I know the reasons why we're in this state regulations but but you know in recent firms or recent years firms about to implement these things and often under a clock and a regulator's watch. But even removing that there's been a lack of golden sources of product data with some exceptions which are actually proved by why they're important. I think that's the future we're kind of aiming at and what the potential can be. So I'll pause there because that was a quite a long way into the first answer, and maybe Pierre will take sort of more of a detailed view on how he sees the problem. Yeah, well, I mean my thinking is not that far like as as you explained like a lot of parties are involved in processing financial data. And they can be found across the industry, but also within a company so it's not because you work within an organization that you're not facing exactly the same problem of having like different step of processing and different hops and different kind of environment where the information is flowing. So, so why is like that like I don't think is that artificial. I think it's because processing financial information will require and requires a lot of special skills or special systems or steps that are you know done by many actors. So, but you know like this graph that you know is not artificial. We have to deal with it. And we have to to manage the difficulties that come with like understanding like what the information is going and how it's flowing and how to smooth this connection. So the first thing that would say we have to do on this graph is to maintain it and make it available, you know, and make it like transparent like to actually all the actors of the firm but also regulators and also like, you know, auditors or anyone that want to understand better what is happening on our side so definitely there's an operational aspect to it like if we understand how the information is flowing in mean that we can better manage our breaks better understand the impact of something failing in our organization. I mean that we will be more in control of actually understanding the information is sort of the right way, which is actually really important for regulators like data lineage and data analytics, you know, ensuring that if we produce a number this number comes from the appropriate source is really important. It's really hard to do that if we cannot maintain actually the graph of information within organization and outside organization. I would say like, you know, on top of this graph, like what we see also is that we have to maintain fairly complex data contracts that enforce data quality across parties. So why again, because you have many, many actors that process information, all these actors get a different view on the data that they're processing a different knowledge that they want to apply so it's totally possible that someone downstream that process data. So we like to enforce some data quality constraints upstream like when information is booked when information is actually entered into the system. So, you know, when you have to manage this complex rough of information, you have to be able to make everyone contribute their vision of data quality in the system so that we can improve or that the flow in improve our information is flowing. The other part, you know, that comes out of this graph being complex is that we have to manage data privacy, data sensitivity. So, you know, when we flow information, you know, again, within a firm, we need to make sure that the right person can see the right set of information and only the set of information that they are entitled to see. You know, it's not possible, for example, a banker to have any information about trading. It's not possible that for some desk to understand the position of another. And it's obviously not possible for people to access private information from our clients or employees. So, you know, like all these concerns are coming out of this kind of fragmentation of information and flow of information. And one last thing also that, you know, I think is important and, you know, different in the financial industry is that we have to milestone our information. Like it is super important for us to be able to reproduce a state of information at one point in time. And it's really important to be able to reproduce compute like in the context of an audit, for example. If someone come and say, you know, you produce this number six months ago, can you explain me why so we will have to rebuild the state of information, the state of data out of milestone information, and actually provide it to people. So, you know, that, you know, for me, like, these are like really the biggest like problems of data that we have to solve actually, you know, day to day in our job. And Fee, what are your perspectives on this and maybe maybe talk a little bit about what it means to have models and structures not only that are common within, within Goldman and that there's some of these issues that Ian touched on around having representations of, you know, what have you have X option, for example, in multiple places but what does it mean to have common data models in the industry and what are some of your perspectives on these challenges from from from your seat within the Goldman organization. Yeah, I think the additional thing I'd add to sort of what Ian and Pierre, what you mentioned is in current sort of world the idea of data is commonly owned within engineering and that sort of how it's developed across the tech industry and also in the finance industry more recently. And that role of strategizing and developing a data strategy is definitely sort of handled within engineering predominantly. And in finance, actually the business are the ones that actually create that data. And we have sort of, perhaps, lacked that communication in some respects between the business that are creating this data, and the engineers that are sort of looking to store and process the data. And I think what we're starting to see and hopefully we'll continue to see is this communication between those two things to get to a much better state in the future as well so I think that's one of the current issue sort of we're working through. So sort of what you were saying on the need for for standardized data models. I think it's about the communication across the industry so as as Ian mentioned sort of regulators are asking for more things we've seen more transparency needed within derivatives markets so through clearing. And all of these things are services that are provided by many different vendors out there, and being able to communicate with them at the moment is a one to one mapping between each of them and if we can standardize that message and that communication. So to quickly reduce those operating costs that you mentioned to the business. Yeah Nigel, yeah Nigel maybe you jump in a little bit with the perspective from agnosis and also just some of the perspectives in terms of dealing with some of the regulatory concerns as well. Well, I would definitely echo what what feature said, you know, from the introduction here. Maybe there's the misconception that there isn't digitization in fact there's huge amounts of digitization in finance and there's massive amounts of data, and it's that lack of standard lack of a standard data model, which limits systems from being interoperable. And I think it's important to to what feature said in order to take data and transform it through the trade life cycle. As it moves from execution through risk management clearing and into trade reporting, having that common standard that allows the enrichment and transformation of that message whilst maintaining the underlying data integrity is critical and that's you know particularly when we think about the reporting which, in many instances, comes later on in the trade life cycle. If you lose that integrity as it goes to the transformations by the time you start reporting to the regulator, you are you know you have issues with your data, and that that translates into having difficult conversations with regulators because you're reporting incorrect information. Something that I think is implicit in in I think everything that you touched on but I like to drill down a little bit more is, is there's something different or special about the the finance industry and the data problem within the finance industry so certainly the tech industry has been facing, you know, the challenge of data and data models and, and all this, you know, going all the way back to the ERP days and, and, but is there something specific about the environment in which the finance industry finds itself in banking capital markets, etc. That that presents some maybe some unique challenges that need to be dealt with. Here. This I think everyone touched on I mean the importance of metadata tracking information about models and metadata around data is really paramount and critical to us. And why is that like because when you look at you know what the final what the tech industry is is working on is deal with enormous incredible incredible amount of volume of data for you know, individual data sets. I would say you know if you look at the problem we're facing like I would say for each individual individual data sets or volume is pretty much medium. Okay. However, we have a lot of data set that assets, they have a lot of depth, and they are interconnected in a really complex way. You know, we, it's not a high volume, simple data structure problem. It's like a medium volume, extremely complex depths of infrastructure that of data structure that work all over the place. So, because of that, the day to day problem that that we solve is not really like, oh, how can I run this super complex calculation massively on an incredible large data set which the tech industry is doing super well with like so many framework and so many good tools. I mean our problem really is more out of this graph of information. Where is the information I'm looking for like it's actually fairly complex sometimes to know even where to actually look at, you know, like the question I would ask, you know, when they find it's like, is it actually the right place like it did the right originating source actually for the information that I had that I'm requested, I'm requesting sorry. And like after people will say but now am I authorized to see it. Okay, so obviously like system control that, but like if someone is looking at the information and cannot find information they'll say, okay, I'm authorized, I should see that. I acquire actually the rights to get the information which is very tricky out of all this can that data set and all this actor that all like different kind of set of information. Like the, then they're like, okay, now I'm untitled, how can I query it. How can I safely get the information. And once we get that like, well, can I store this information can I redistribute this information, because you know, as we saw before we have privacy and sensitivity, you know, constraints in manipulating our data but we also have licensing constraints with actually the vendor that provide information to us. Like, and when you look at, you know what Nigel was touching on about regulators, you know, like when we get this information and we're like, okay, but I see an error in this information, how can I make sure it's corrected at the right place and how can I make sure it's not going to So you see like the kind of problem we deal with seem to be a little bit different than the kind of problem that the tech industry is serving so and to answer and to, you know, be able to maintain all this kind of metadata, you know, around information we process. The emphasis is really put on modeling, you know, modeling, meta modeling so that we can also, you know, manage our models better. You know, as much as possible leveraging data standards and you know data, like data standards that are available outside so that when we communicate inside we can easily communicate outside. So I mean like really like, you know, being also coming from tech, you know, like that's kind of what I feel is really big difference and that's why you know model is actually coming all the time in discussion in finance. So what are some of the complexity we've touched on a little bit but Nigel and Ian any other sort of forms of impacts or types of issues that emerge from all of this complexity. Yeah, sure. And so, you know, developing on what Pierre said and reiterating maybe some of the numbers that Ian had given and Tasha referenced it right at the start. You know, when you when you think about the notional of interest rate swaps that are traded say on a weekly basis, three to four trillion dollars, which is a very big number. That translates into about 30,000 trades. So from a volume of messaging 30,000 trades isn't a huge number of messages to be sending around on a weekly basis. And 60% of those are traded electronically and 90% of those are clear to clear and house. So there's a lot of data that is being sent around, probably in a variety of data formats or not probably definitely in a variety of data formats. And depending on where the platform where what platform the trove is executed on where it's being cleared, etc. And so if data inaccuracies are generated, then there's a huge risk of operational error caused by either incorrect risk management of a position because the economic details are wrong, or incorrect, incorrect payments. So about three trillion dollars of notional decimal point in the wrong place on a price or an incorrect settlement date makes a very makes a big difference. And you regularly see in the news issues that you know certain counterparties experience due to some of these data inaccuracies. The second thing I just talked about is the pace of innovation in the financial services industry. And this is where the development of the data model and the design of the data model is important. You know, any standard that this bill needs to be sophisticated enough to capture the detail that the PS spoke about. It's simple enough to be understandable by the users so that it can interpret the messages. It needs to be flexible enough that it can deal with the composability of financial products which are tailored specifically in many instances to the individual transaction. And that, you know, without the model needing to be continually redefined, but also with the tooling available which sort of segues into what we're going to talk about, but with that tooling available, and the governance in place to extend that data model as required. So the common domain model we think has these characteristics and also provides a common translation layer between the many other standards that already exist to allow platforms to interoperate. Ian, did you want to chime in with something? Well, yeah, maybe just to heart back to one of the points I made at the start. There's another element is that there might be relationship documents you need to reference which are long dated. You might have set up a trading relationship 10 years ago and was that master agreement stored in digital form or is it a proxy for it held in some other static data system? All those things are challenging and it goes to sort of lineage as well, where did this decision come from to make this payment, what's it based on? So that's the only thing I'd add. And it's another dimension to the problem. So yeah, but I definitely echo the idea it's complexity of the data rather than scale and volume. Great. So let's shift gears for a little bit. We've got about a half hour left and I want to make sure to leave some time for questions. So we're here at a summit about open source and probably a lot of folks are interested in where open source fits into all of that, which we've discussed so far. So maybe starting with Phi, so we just spent the four of us amongst a group of about 20, 25 other people. We just went through this effort of doing this pilot of using legend with several other investment banks and other institutions. And now legend is open sourced and part of the open sourcing of the legend code is the fact that now we have these models that are built in the legend language that themselves are open sourced. So maybe Phi, starting with you, talk a little bit about why open sourcing data models, not just the code for the underlying workbench and language but having the models themselves built or being able to be distributed as open source code. Why is that a good thing and what are some of the benefits that we're sort of all expecting and should see within the industry from that. Yeah, thanks Rob. So hopefully, from what we've already discussed, you can sort of tell the need for data models, and they definitely do exist in the financial industry already. Commonly, probably more so within individual institutions, but sort of CDM as we've mentioned, being one of those standards externally trying to standardize the industry as a whole. And so sort of as you said Rob, why open source and I guess that standardization piece is the really important bit that's the piece that's going to bring us our reduction of costs and our improvement in sort of trade processing. And so in terms of open source I think of this in two ways. So one is the visibility and the access to these models and sort of the benefits that that brings but also the development of these models. So from a development perspective, the sort of principles of open source in terms of early and often releases should bring us more speed and agility and that flexibility to improve these models quicker. We should be able to add to and adjust when you use cases come up within the financial industry and these could be from external factors. And we've mentioned regulation is one already, sort of as new regulations come in, definitely more data might need to be added, and all of that sort of thing so that speed and agility to meet those demands and equally internal factors so from the business perspective they're always looking at diversifying across all the banks including with clients. And so how do we sort of meet those demands of new business, and that agility and flexibility to add to models quickly is really important there too. So the other sort of principles that I look at from an open source perspective this inclusiveness community and that collaboration. I think all of that boils down to me to diversity of thought. And how do we actually tap into a lot more diversity that's out there not just within the financial industry but also within the tech industry and other industries that are working with data. I think there's a lot for us to learn within the financial industry in terms of how tech companies are developing and thinking about data. I'd say they're probably one step if not a few steps ahead of us in that space and so I think we'd be silly not to tap into some of that as well. So within the industry side. I think historically working with sort of industry body such as is the lots of the larger banks and larger institutions in the market have definitely been engaged in this this discussion and that idea of standardising and building data models, but perhaps the clients and the vendors and vendors and service providers haven't been able to get so in touch with those conversations and so by open source by that availability by expanding that community. Hopefully we'll get more diversity of thought from across the financial industry. And as I mentioned on the on the outside industry side of things as well. So that's kind of how I see open source helping with the development of these models. And so when I talk about visibility, I guess that comes down to the principle of transparency. So these models are now open, they're available, everyone can access them. And that should hopefully help us speed up the adoption of these models as well. And really it's at the adoption level that we really gain the benefit. It's all well and good as having these data models available but until they're being used. And so I kind of, they are connected that adoption is going to come by having better standards by having more heads at the table when we're building those standards and hopefully that will speed up the adoption too. And finally for me it's governance so I think we've sort of alluded to that as well already through this discussion, but sort of this financial data is going out to regulators it's going on legal confirmations it's requiring how much margin where we're settling and so that governance over the data. Is vitally important in the financial industry to in terms of the open source so the SDLC of code brings us governance in that respect in terms of how we are committing code, who's committing code when it's being committed. And also industry body such as is there have to be there to actually sort of govern who's putting into these, these data models and how they're going to be used and sort of helping to disseminate that across the industry. Great. And Ian or Pierre or any, any sort of other examples in particular and any, any ways in which you see open source models being used out already in the industry. Nigel as well. I'm not sure who wants to jump first. So, I think Nigel mentioned that, you know, there are, there is digital data. We have, you know, standards in the past, such as fpml use for messaging. I believe, you know, the CDM project at its heart is intended to be open source. We are, you know, that's what we've been doing. I think, you know, Nigel has a few examples of the things in particular which, which we probably presented as as recent examples of using and implementing it and that's what, you know, Reg knows that we're working with is the members on. Yeah, sure. Thanks Ian. Yeah. So, so, Reg knows this has been working with with is derived the last three years on the development of the common domain model that's on the Rosetta platform, the DSL of which is is open source. Yeah, as to what Fee said, you know, the expansion of the community that's helping develop the model is is, you know, very gratefully received and the more people that are developing the standard and contributing to it that the greater the penetration will be, which is obviously a virtuous circle. The, you know, there's been a number of announcements recently in the press over the course of the year. Axoni, who are a distributed ledger technology platform are building a solution for the equity stocks market, which uses the common domain model. There's also been recent announcements around digital asset working with Isda on a clearing pilot using their damel language with the CDM as a model underpinning it. So, yeah, and that that's just two examples there are a number of other examples out there of firms that we're working with bilaterally on initiatives to to adopt the CDM. And that's really, as well as, you know, larger industry platforms that that should be using the CDM and solutions coming to market for the same. Yeah, I would add to that actually, you know, CDM is definitely a great effort to to be able to start to get a common language across the industry but I would like to insist on something like it's not only about open sourcing data models. What we want to do is also apply the tools and the SDLC leveraged by also open source project to make it easier, you know, like for different company to collaborate exchange ideas and agree on how to to interoperate like, you know, what we want to do is to have a new model version, you know, created quickly to address like needs of specific companies. Because what we see too often is that a company will have a specific need will not want to go actually to the standard body and say okay can we incorporate maybe in direct time crunch and they actually, you know, don't want to make the investment and they fork and diverge from the language. So what is really important is to smooth this process, learn from, you know, what the open source community has been doing for some time, you know, gathering a lot of developers, or you know, like in this case, a lot of business actors and a lot of technologies to, you know, refine the specification and get, you know, a release of version that is really accelerated, like our field that we want to accelerate the process so that we never are as less as possible diverge from common language is agreed across the industry. So our feeling for that, you know, it's not it's not only about open source the model we also really want the platform to be open source, like, so that we can invite, you know, as many people around the table so that the community can be, you know, more broadly agreed on, and so that you know we can achieve this agreement and not diverge and that is really why you know we felt that we wanted to open source or platform to make you know this acceleration and this kind of community driven modification of model real. I just like to jump in and back up up here just said, and indeed that's sort of the limitations of some of the standards and data standards which we have today. Even our own other standard that is the FPML did development of FPML has seen a lot of people have their own internal fork, their own variation of it. It's not a standard if it's not the same everywhere. And you know the development of some other messaging standards you think of the, the, the life cycle to get the change into swift for, for, for effects transactions. A lot of those standards that the timelines deliver things is just often too slow, and people then make bad decisions around forking and creating their own variations. So that's definitely why it's of interest to is that to see new new standards delivered and developed indeed in this way. So I want to open this up to questions in a second so if, if folks are interested in asking some questions about to any of the folks are all the folks on this distinguished panel. Please use the Q&A feature I see that James is providing some guidance on that in the chat so definitely ask questions so we can start, start getting into those we want to make this interactive. So, you've touched on FPML and, and then there's also been some discussion of sort of the case for open source and, you know, hopefully we have a sympathetic audience today. So, where do you see this going so, so, so, you know, there's the Rosetta DSL there's the legend language as a legend platform that's been open source. I think we're all feeling a groundswell of support for open sourcing clearly Goldman's announcement last week was a big shot in the arm the morpher project which we're going to hear about in 20 minutes is another example. And where do you see this going where we're what's look out, you know, hopefully 12 months we're not all having to wear these things everywhere and and we're all back seeing each other and able to head to the pub but besides that we're looking at 12 to 24 months. So, where do you see this going in terms of open sourcing around data models and and also the adoption of some of the CI CD practices that James alluded to or referred to in his presentation a little earlier and that Pierre touched on. What's the future for open source in this area around data models. And maybe maybe fee if you want to give some perspective first on the, some of the stuff you saw coming out of the pilot and then maybe you can shed a little perspective as well. Sure. I think sort of out of the pilots we wanted to make sure we were contributing to something meaningful in the industry. And I think that's really where the future of this has to go and so in terms of as we look to improve standards more broadly. And that input of thought that I spoke about before, and it's going to be a big, a big player in that but we've also got the opportunity to expand a lot more broadly now so I think we've discussed quite significantly today around sort of this idea of what we would do with the transactional data so sort of that specific trade data that the lifecycle of a trade. There's lots of other areas of data within the financial industry as well whether that's reference data in terms of the actual underlying products that we're trading, whether that's market data in terms of pricing that's available and and how do we get all of these things to communicate actually, and look at the entire lifecycle of a trade from inception all the way all the way through and some of that's transactional as I mentioned a few other data sets as well and so that expansion is going to be really important. I think also considering new use cases and some of the nuances. Often in the financial industry we start with sort of the bigger volume first and as we get through that we're going to end up with some of the more complex exotic pieces. And they're going to start to raise questions of perhaps what we've got already and how are we going to improve what we have there, whether that's adding to or adapting current attributes and current models. So I think that's in terms of sort of improving standards and pushing the industry in that sense. I mentioned before as well this idea of cross industry, I think there's a lot for us to learn about the acceleration of data driven transformation across multiple industries, not just in finance and so I'm definitely excited to see where that is going to go in the future. And I guess that pushes into one final thing for me is the innovation that's going to be available. I think lots of these things and there's been little bits of pockets discussed around in the financial financial industry in terms of new technologies, being able to branch into blockchain smart contracts, but I think actually as we open source and we move further we're going to be able to gain and what's happening in the technology industry and hopefully find better solutions than we can even think of today on how we're going to solve some of these problems that we have. Yeah, I would like to add on that like you know as as as she said and as I answered and even Nigel to like what we want to do is is to accelerate the standard accelerate cbm accelerate like maybe different areas actually a powerful business from a standardization and model perspective, like, and we felt like, you know, to do that to open source the platform so what we would like to do too is make sure that we can try to gather more community in some of the development of the platform like if people are interested like the site is at the end of the talk. And you know like to make sure that we can open contribution that people don't see the project being biased or you know like help a specific company we open source and our phenos like we provide like we gave the code actually to the HOS Foundation that is hosting our code hosting the process of development of our code now. They also host actually our platforms so that different banks can work together into a total neutral environment that is not like specific to a bank like it is part of this organization, you know, which is great like that allows us to work actually together to better and save an easy way. So you know what I really would like to see in the future fully is like acceleration of the model but also acceleration and contribution on the team the platform better so that we can include again the business users and the technologies in building this model that are that are going to be a really, really hopefully transformative in the industry and make things faster and be there and safer in terms of operating financial information. Great. Yeah. Go ahead. Yeah, just just to add like I'm equally excited that the diversity of views that could come along and really, we'll see real innovation. That's very exciting to is that, you know, and also, you know, is this mission is, you know, safe efficient markets, but also trying to make access to those markets a lot easier. And how we do that is by making things, you know, open source is this legal standards the master agreement was essentially an open source document, which is published in the 80s, which made, you know, in straight swap markets which was a bilateral market between a couple of banks, you know, across the Atlantic to go from a few hundred million to the trillions the hundreds of trillions we quoted earlier. And this is the same thing. We give people access to the same technology, the same software that banks can use we can only see good things happen. Yeah, the only one thing I would add is an area of particular focus for for is to run diagnosis is development of digital regulatory reporting. And you've touched on that through this conversation and we think that partnering with regulators to create machine readable machine executable read reporting, we can drastically reduce that cost of compliance. The CDM is a is a core component of that because it provides the common data model to represent the transactions. It's recognized recent energy 20 text print competition. You know, that's that also is a is a good link to the Phinos rec tech work stream, which is focused in this space. So that you know there's crossover between multiple Phinos work streams also and looking forward to working together on on that topic together with Rob and others as the agenda progresses. Nigel, you nailed those talking points. It was great. Just something on that that maybe you that what we just talked about the digital record reporting. I was speaking to a particular sort of large global entity in the financial services, a large sort of regulatory entity might say, and they talked about changing the culture of how we deliver regulation. And I thought that was great phrase to change that, you know, to have these open models which we can all work on together, and also have that conversation with regulators and demonstrate what we mean. When we say your regulation doesn't work or you need to improve the law here because it's unclear to the market we can demonstrate that with data now through models and true platforms like legends and the CDM. Fantastic. So I think we have our first question. So Adam Jones asks, and maybe I'll pose this first to Pierre. How can someone primarily from a technical background, get involved in legend to start leveraging and contributing to the use cases and I'm going to, I'm going to also concurrently send some information that people can see but Pierre. How can folks start to get involved in, in legend and maybe start looking at the models and maybe even doing some pull requests and stuff like that. Yeah, so, so there's a lot of depth and a lot of things to do there so I mean first you know you can look at the code on the GitHub website of the GitHub areas of finance. The project is structured in many modules and there's, there are things to do for all kinds of developers really like the first thing is like studio, which is our like additional environment, where people can, you know, work in JavaScript to improve the way people enter model, manage model, enter code. There's the pure language itself, which is a big language where you know people that really like language can work on modify tweak and there's like the full platform with like the SDLC management where we can add more SDLC plug-in like right now or SDLC backend I would say is GitHub but it's totally possible to add a new backend and integrate with different kinds of SDLC. Like, that's for the area we have today like going forward we're going to open source more code, leveraging the pure language for transformation like when we interface with different data sources like relational database and want to map or model to transform kind of infrastructure, we actually write a lot of transformation code that goes model to model like for example how to transform a pure function into our SQL query so that we can translate a request coming from a business user that is not a technologist into a technology request that will leverage an infrastructure. And that's like, that's a space where you know me and my team spend a lot of time, you know, how can we translate, you know, and leverage models and model to model transformation to be able to access information. I mean that's if people want to contribute like from a leveraging perspective like you know like you can start to leverage the platform to define models that you can transform into different kind of technology of transformation like for example Protobuf, JSON schema, whatever for solidization purpose, going forward you'll be able to use these models and map them to infrastructure and provide a query tool for actually your users. Going forward also will also allow people to translate this query and this metadata that is gathered to acquire information to deploy it in production to a service environments that that's the kind of thing that will operate day to day on our side and that we're going to be able to for us to be contributing to the project. Great. Great. And if folks have any more questions don't hesitate to to ask them I've I also posted some more follow ups with links to add them to your question. The question that we get, or we've been getting and probably bears a little bit of context and explanation. And it's in the read me is, we use the legend project uses get hub and get lab and sometimes people are like, Oh, how, where do they fit in. So, get lab provides the SDLC for the modeling efforts so when you're in Aloy studio and you're working on the models themselves under the hood. There's a get lab instance that's managing the models as effectively open source commits and and and that's abstracted through studio. And as Pierre alluded to a second ago, we then, and the actual underlying code so the, the builds of studio and SDLC and all the components to legend itself happens through get hub, and what's really interesting and I think a real. I know Pierre and I have talked about it. One I want a bunch of times but one of the things that really I'm excited by is the fact that the builds are happening on windows infrastructure on the get hub public repos and so this is really authentic open sourcing this isn't just a case of Goldman having put the code out here and then it's got its own fork. You know behind the firewall and that's where all like the nightly goodness of builds and stuff is happening like the CI CD stuff is actually authentically happening outside the firewall on the public infrastructure on the public Phinos get hub infrastructure and the models are hosted and control through the public get lab infrastructure so it's pretty cool. I would, I would continue on that a little bit so there's a difference where we operate the code of our platform, which is in get up, but the models which is in get lab exactly. It's really important to understand that we need SDLC to operate on models, because you know like the point of the, the, the effort that we're running within our firm and hopefully outside our firm is to make sure that we can drive a lot of information flow through metadata, so that the business can be involved, but that also that this metadata can execute action production flows, and there's no way we can execute in production production flow without without following a proper SDLC. The same way, you know, there's no way we'll be able to accelerate, you know, the collaboration and the work with other entities if we don't follow the tools you know that people for everyday engineering of SDLC so that we can make, you know, a lot of changes really faster but in a super safe way and open way so that people understand who is making the modification at one point we reviewed the information, how for example this information will then be actually published to a standard, you know, like, like CDM, you know, like so the notion of SDLC is really key there, because what we found you know, like looking at the way people manage dictionaries usually is that they are they are managed by business people, but they, they diverge extremely from what's happening in real life in system. So we have this kind of beautiful set of information that people like to communicate about right PowerPoint on and others but when it comes to but what's happening in real life, you have a totally different picture and answer. So you know, like making a language that is like easy to use by a business user that make the business user enter actually the specification and other parties to but make also all that executable, audited and manageable so that we can execute on it and actually get you know what you see what you get effect between what what people communicate on but what is actually happening in real life in system and in data flow. Yeah, so following up on that there's another question in the q&a so with regards to the SDLC and the open models. What steps are taken to test the models themselves how are the, how are the models tested. So, you know, like, there are many strategies actually to test models so we're not using a logic based approach of models. What we're doing is more like something a type system that is closer to object oriented with the functional language applied to it so testing a model is an interesting thing like the first thing that we ask people is to add transference and validation to really specify really clearly what the intent in the context of the model. Now, the test comes to when you map this model to different data sources. And you want to ensure that the question that people are asking are actually accurately answered. So what we let people do is enter model, enter mappings and next to the mapping right a set of steps which are like a set that they would ask with a set of sample data that would actually retrieve information. There's a lot of depth to hear about like generating test data to actually activate models. So we have strategies that that source information from our system safer it like to make it available so that you know we protect that the test data is not actually providing private or sensitive information. But I would say I would say the first thing is to guarantee your model is consistent to refine it with constraints and guarantee that these constraints for tests are violated when they should be violated. The second one is to make sure that when we map actually these models to a different infrastructure and people ask questions they get back what they want. You know, like to give a feel of the number of tests that we operate right now for the models that we have in our, in our premise, we're around like 12 to 13,000 tests that will run for each commit actually because we're in a total continuous continuous development process. So, so, so they have to give a few that that is like the kind of volume we have. We have time for maybe one more question. So if anyone would like to ask a question please do so now. In the meantime, Ian fee and Angel, maybe just a quick closing thought or two is as we bring this to a close. Any, any final thoughts on our, our glorious future of open sourcing data within financial services. Not really, I think we said, no, I think we're very excited to see the shift. I can certainly test to the fact that, you know, these, these stuffy old financial institutions have really changed their minds and they're starting to see the benefits of open source and a lot of the new technologies in the last couple of years in a way we've been really, really happy to see. Nigel or fee. For me, it's sort of that excitement of where this is going to go next. And I think if anything it's probably a call to everybody to sort of get involved, whether you're finance or not finance there's ways of doing that whether through Finos or other industry bodies sort of start contributing to this and we'll start to see where that innovation takes up. Yeah, I call of that the acceleration that we've seen over the last few months through the Finos pilot work streams has, you know, we've had a huge amount of development in the CDM just over very short space of time so it's really exciting to see and hopefully that will continue. I will test to the lack of stuffiness because I actually put on a collared shirt today for the first time in a year but I know full well that if I were to wear said collared shirt to the 23rd floor 200 West I would be overdressed. So it appears laughing because he knows it's true. So anyways I'd like to thank everyone on the panel this was fantastic. Really appreciate it. If you have any questions about about legend, don't hesitate to ask at legend dash inquiries at the nose.org. There's lots of information, definitely dig in as well on the is the CDM and some of the good stuff happening there and the work that the diagnosis team is doing a Rosetta and which is the DSL that the CDM is represented in. So lots of good stuff happening. Thank you to everybody who's here. So with that I'd like to turn to our next panel and I have to say that the next panel is panel I'm I'm also really excited by the Morpher contribution to Finos was another large contribution this year from Morgan Stanley, and there's actually some really interesting sort of relationships between what the legend contribution from Goldman and some of what's possible in Morpher from Morgan Stanley so without any further ado I will turn this over to Steven Goldbaum from Morgan Stanley. Hey everyone. All right, thank you. Tosha I think. Hi everyone. Thanks so much for joining us today we're very excited to be here and really thrilled to for the opportunity to introduce you to open rectic and the way the financial services industry is leveraging open source to comply with complex regulatory requirements. I'm honored to be part of this panel today along with Steven Goldbaum, executive director at Morgan Stanley and Mark Marin the designer of the boss key programming language and principal research software development engineer at Microsoft. I'm a strategic initiatives manager with finance the fintech open source foundation, which is part of the LF. As you may have heard in the opening remarks earlier today, we are a member based organization, our members include some of the largest investment banks as well as tech firms fintech firms and vendors. And our goal is to enable open source in financial services. And we have a community of hundreds of developers and subject matter experts from financial institutions, tech firms and vendors collaborating on our open source projects and leveraging the efficiency the cost savings and all of the other benefits of open source. We host over 40 active and incubating projects, all of which live on our finance GitHub organization and some of which you will learn more about today. Earlier this year, we launched our open rectic initiative in response to two complementary realities on one side by definition, our projects focus on on technology solutions to common challenges and use cases in financial services. And importantly, solutions that do not provide a competitive advantage. That is to say that our projects target challenges for which many organizations would need to normally build a similar solution. And the idea behind our open source projects is to bring all of those actors together, get them to collaborate on building a solution. And lead to stop reinventing the wheel and wasting time and resources to develop the end iteration of a very similar solution. And at the same time, we're promoting open source in financial services, but we were missing the perspective of a key industry participant, which are financial regulators. And not only were we as a community missing them, but they were also missing out on all of the opportunities and benefits of open source collaboration. Before we launched the open rectic initiative within finance with a two fold objective. First to streamline and utilize regulatory compliance efforts at banks through collaboration and open source projects. And on the regulator side, the goal is to get them familiar and comfortable with open source address the concerns they may have and also show them the potential of open source for regulatory supervision and for issuing regulation. Our very ambitious vision for open rectic is a model in which regulators adopt open source code and best practices and define regulations in open source code form. While retaining full visibility on on their supervisory activities and monitoring. And we're on the other end industry participants collaborate to codify and standardize those regulations and to mutualize the interpretation of regulatory requirements. We know this is a very ambitious and long term goal, but we're very excited about the interest and engagement that we've seen from the community so far. And we're also looking forward and working hard to see this vision become a reality. Where do we stand now. When we launched the initiative earlier this year, we were expecting we'd need to do a lot more work to educate and convince regulators about the benefits of open source, and also to debunk the myths around open source security and vulnerabilities. But we have been very pleasantly surprised by the response from the regulatory community, both in the US and in Europe, who are a lot more familiar and comfortable with open source than we initially thought. And I also think it's safe to say that they're already on the next phase so they're eager and looking forward to identifying use cases and projects to start collaborating on some regulators in the US like the CFPB are really mature open source consumers and contributors and even have an open source first policy so they do all in open source. We are in an ongoing conversations with US and European regulators to discuss potential collaboration opportunities and even some potential open source contributions by regulators, which is really exciting. We have also had air the Alliance for innovative regulation join finance as an associate member recently. There is a US nonprofit focused on promoting innovation in financial regulation. And we will also have representatives from the CFPB and former regulators join our upcoming open source strategy forum on November 12 and 13. And more recently just last week, our board approved the creation of a regulation innovation special interest group, led by ING and the Alliance for for innovative regulation, which will provide a venue for regulators and financial institutions to discuss and identify specific challenges that they like to tackle through open source collaboration. Special interest groups just to clarify that because it's it may not be a very common term. They're supposed to work on the problem space or the definition of the problem or challenge. And then once that's been identified, it would be up to a project to develop a solution through open source software and standards. And the other branch of our twofold effort, the one around identifying projects for financial institutions to collaborate on. We've seen fantastic progress as well. One project we're particularly excited about is more fear, which was recently contributed to the foundation by finance member Morgan Stanley, and which has tremendous potential to drive open source collaboration on regulatory reporting and compliance. More fear is a multi language system built on a data format that captures and applications, the main model and business logic in a way that is agnostic to the technology. So, more fear allows you to have business knowledge available as data and to process it programmatically translating visualizing sharing and storing data and more relevant for our rectic initiative. The team at Morgan Stanley has already used more fear to calculate liquidity coverage ratios as mandated by federal regulations in the US, essentially by turning regulation into code. I won't take any more time talking about more fear because we have the honor to have Steven Goldbaum on the panel today, who's the co creator of more fear and an executive director at Morgan Stanley. Steven, the floor is yours. Thank you itana. So, providing regulations as code. Why would we want to do this. Well, first of all, regulations are incredibly complex. So turning them into running systems is an expensive process. So, having every firm do that across the industry means that that's expense is expanded across the industry. So it just makes good business sense to say, well, maybe we can share the effort we can share the burden and by doing so everybody wins. And in the process, maybe we get even more accuracy. That means a good deal of challenges on the technical side. For one thing we're talking about sharing complex logic. So while the industry is fairly good at sharing data and figuring out how to use data across different technologies. Sharing logic is a much bigger and more difficult problem. And I'm sure that we're not excluding any industry or any firm. So there's a vast amount of different technologies across the industry. We can't really say well we're going to solidify on one technology and everybody's going to have to use that and if you don't already use that technology then you're going to have to do an expensive rewrite in order to support this. That's just not a viable approach to something that needs to be used industry wide. And even more importantly, we don't want to get in the way of evolution of technology. So we all know that technology is evolving very quickly, and we wouldn't want to regulate Tory or regulation is code to stand in the way of firms being able to adopt newer and better technologies. And maybe the biggest challenge of all this is that the whole thing is to be incredibly accurate. And one small mistake can mean financial impact across a number of firms. And so we have to make sure that this is correct. There's no ambiguity, and that there is many guarantees that this is correct as we can possibly get. So given all these technical challenges, how might we approach this? Well, we're going to look at a technology called Morpher that was recently contributed by Morgan Stanley to Finnaz. So Morpher is all about defining a standard for translating business logic across technologies. And so what Morpher does is it provides a data structure called an intermediary representation that we can encode business logic in and then a set of tools to process that business logic into different contexts. So some of those processes might do things like provide real time documentation that that is always up to date or even interactive documentation that users can use to audit results that they might think wonder how they got to that. Similarly, we might want to take that business logic and translate it into other programming languages so that it can be run in different execution contexts. Or we might take that idea even further and if our firm has a standard set of platforms, we could actually generate entire applications from the business logic that conforms to those platforms and regulations. And so when we think about coding regulations, that's exactly what we want to do. We want to be able to say, well, here's the business logic, it's coded up, and here's the tools to either generate the tools that exist or to customize those tools for our firm and our firm standards. And so you might ask, well, how do we code the regulations in the first place? The developer doesn't make any dictate any particular way of doing that. We do offer support right now for Elm, the Elm programming language. We do that because Elm is a very simple language. It's a pure language, meaning that there's no side effects. The developer has to think about business logic and business logic only, and it provides a lot of guarantees in terms of the correctness of the models and making sure that we don't have any errors. And as we mentioned, it's very important that we don't have any errors because that could have some serious financial impact. And so to get an idea of what this looks like in practice, we're going to look at something called the liquidity coverage ratio. And I'm scrolling through this just to show that right now it's provided in hundreds or thousands of pages of very dense pros. And so, you know, wow, imagine taking this and trying to turn that into a computer system when we're looking at something like page like 61,000 or something like that. You know, that would be a daunting task for anybody. So obviously, if that were provided as code, we wouldn't have to do that. And the interesting thing is that if you look at the documentation, it actually looks a lot like code in a lot of places. So you can see how it can naturally translate from, you know, maybe this document could actually be done in code, right. And so we're looking at something right now, the definition of a counterparty. And so a counterparty is limited to this list of values. And we know in programming that that translates to something like an enum or a union type in this case and now it's a union type and you can see how the code naturally flows from the documentation. Similarly, this table actually is a specification in the documentation in the LCR that specifies how do you figure out what maturity bucket something lies in. And so it's presented in this kind of arcane format with a lot of explanation, but really what it turns out to be is very simple code. And so we can see that here in our model. So looking more into like rules and categorization. There's a lot of rules in terms of how do we classify assets and cash flows. And those those rules again are intense pros. And when we look at the actual model, we see that it's actually pretty simple programming language constructs. That's actually something that we want right we want to be able to look at this program and be able to understand it for for for many reasons least of all, I mean most of all maybe because as many people look at it and can understand it. Then we can catch, you know, errors and bugs and make sure that it's correct. And finally, so if you go look up the LCR that probably the first thing you'll find is this, this equation that the LCR is really a mathematical equation of the high quality liquid assets divided by the total net cash outflow. And so the LCR provides this kind of calculation math oriented calculation document as a supplement that we can use to create the same code, right and we can see that it's very, very similar. In fact, if you go to this document and if you go to the LCR example on more for dot examples you'll see that it's basically one to one between the document and the code. Which means that, you know, this is the idea of providing a regulations code is really an achievable process. So at this point you might say, well, that's great. You've got the regulation in code. It's an Elm. We don't run Elm. So what are we going to do with this? Well, that's where more for comes in. So with more for we have a set of tools. One of those tools parses Elm into the more for IR and as you recall the more for IR is that common data format that all the business logic is saved in. So you can see here's some of the concepts that we saw in the model before level one assets level two assets. This is the more for IR. And so now that we're in the more for IR we can then generate from that IR into other languages or other platforms. And so we're looking at the scala generated code from from essentially from what is Elm into more for an into scala. And some of the other things I mentioned that we generate are things like documentation. We can support other languages we've had generators in sequel before. So there's a range of things that we can do with this. And that's where more for helps out the industry by providing a set of tools that are that are already common like the scala generator and then providing tools that different firms can use to customize the output to match their own platforms. And again, that's exactly what we're after. We don't want to make firms to rewrite their platforms. We would rather provide the tools that take the regulation and allow them to adapt that regulation onto their existing systems rather than take the regulation and say, well, you've got to rewrite everything in order to make this work. So I'm going to finish up with, well, one of the things we mentioned is that the regulation has to be correct. And in this case, the LCR provides a number of examples to to ensure that basically everybody's understanding this correctly. And examples are great because we all know we can turn those into unit tests. And so, in our example that's exactly what we've done. So the LCR has a unit testing capability so we can use the capabilities of Elm to unit test our regulation as code. Wow, unit tests are great. And they do catch a certain a lot of bugs. We really want something better than that when it comes to regulations. So, as I mentioned before, you know, if there's even a small error in the regulation that could have pretty significant financial impact. So we want to do whatever we can to make sure that those regulations are correct. And Elm is great in that it catches a lot of errors that a general purpose programming language wouldn't catch. So things like, oh, if you forgot to handle a certain case of enum, it will catch that and it won't compile. And you'll have to correct that. So you can guarantee that there's a certain quality level. And the common saying is that if it compiles, then it's guaranteed to run without runtime errors. And that's true of Elm. It turns out that we can actually do better. And again, doing better is just going to be that much better for the industry. And so next thing we're going to introduce Mark Marin from Microsoft Research to show us how we can use the Bosque language to reduce risk and increase confidence. Oh, great. Thank you very much. As I mentioned, my name is Mark Marin. I'm from Microsoft Research and I'm going to be talking about the Bosque programming language and how it can be used to build high assurance model code in a framework like Morpher. So to start off with the Bosque programming language project. It's a programming language. I've been working here at Microsoft Research. And it's focused on exploring the boundaries of what is possible with building high assurance software using techniques like formal verification, model checking, automated fuzz testing. And what we're doing is we're taking a slightly different approach than usual. And instead of taking an existing language and attempting to retrofit an analysis on top of it, we're actually looking at what things make building one of these analysis difficult and what makes them work really well. And we're trying to build a programming language specifically to get as much as we can out of one of these analysis to take away all the things that make them very difficult to scale or make precise. And in the interest of the fact that this is the open source summit. I also want to mention that all this work is happening public on GitHub open source licensed so this is freely available to everybody who's interested both the core language itself, and then the collaboration we're starting up with Morpher with the support of FinOS. So, as I mentioned, there, this is sort of a reverse of the usual design process usually you build a high level language, and then you build an intermediate representation and then you build tools we're sort of going the other way around so boss key we want to target some tools. And the first thing we're doing is building an intermediate representation that makes those tools as powerful as possible. What comes out here is a lot of the time there are simple things that come out in the intermediate representation, like is it a bytecode based versus register based. How does it expose iterative constructs. And these can have a huge impact on both the difficulty of building these kinds of verification or analysis tools, as well as the final results that they're able to produce. So, we're building this IR specifically to make these tools as powerful as possible. By excluding things like unbounded looping constructs or random a lot of indeterminate behavior. Once we have that, the next question was how can we build on top of this IR, a very friendly high productivity programming experience that developers expect when coming from a modern platform. Right. So we want to have familiar language constructs that allow rapid iteration that allow developers to feel right at home, while still targeting this IR that has supports great analysis tools. We also want to make it easy for developers to feed useful information to these analysis tools by allowing them to easily specify lots of validation constraints and logic directly in source code without having to learn a separate modeling language set of specification tools. And our hope was that this would create a virtuous cycle, where as developers add a little bit of extra constraints, the powerful tools we have make these very useful and encourage developers to add more and more constraints, and these constraints and additional specifications actually make the program easier for the developers to reason about at a high level. So we get this nice confluence of information being more and more useful as we go on. So, without any more high level view us talk about what the boss key language looks like. So this is a simple program, it's a function so it's a function called abs that takes one variable and it returns it in. And as you can see it's got a nice familiar block structure. I can assign a variable multiple times in the flow. I can have conditional if statements and I'll just figure out the right sign and then multiply by that to get the absolute value. So it looks very familiar to most devs coming from a type script or Python or, you know, go type programming language. We also use heavily things that you'd see coming from the functional space so we have higher order functions so all of takes a predicate the checks if something is greater than zero. We support rest and spread arguments. So, like you'd be familiar with coming from JavaScript. So a lot of things that make it very easy to plumb things together quickly without writing a lot of boilerplate code. We also support objects and what you would consider maybe mixing classes or traits, as well as union types so it's very easy to adjust your type system to model the particular domain you're in without having to build a very complex and touchy ontology. And then nice things like switch as expression null coalescing early returns with errors and all this stuff that you would expect if you were coming from a modern web language. So let's go into a demo and show kind of what this looks like in terms of some of the modeling code that we discussed earlier, and how the tooling experience works. So for this demo, I want to zoom in on a small portion of the code, the modeling code we were looking at in more fear written in Elm earlier, and look at how a portion of it might look if you wrote it in in Boston. And in particular this chunk takes some inflows as a list, and some outflows as a list, perform some computation on them returns the result. And from the spec, we can actually know a sort of a high level constraint that whatever the details this computation are that the return value should always be greater than or equal to zero. Now, what our tooling stack allows us to do, and the nuances of the Bosch language, we can actually take this code as written and translate it into a logical model and first order logic. It can be consumed by a solver like sat module theory solver. And I can show the code here. As you can see it's a little hairy there's a lot of low level detail because of the exact logical encoding, but this actually captures every possible execution in a full precision of this code written here. And with that we can go and run our solver and expand this and search for any possible input that would cause that assertion to fail. And as you can see it found one where the list one contains the element zero. And the list two contains the element negative one. And so if the first list contains the element zero, this sum will be zero. And if the second list contains the element negative one, this sum will be negative one, the result will be negative one. And sure enough, our insurers clause will be violated. So this found a counter example that we can go and use to debug our model code, just like if we had found a failing test days. Now we can fix this by actually pinning the result amount to zero, which is the correct fix for this bug. And now if we go back and run our tester and don't ask it to generate a model because now it's going to be correct. It will go, it will expand everything it'll actually find a conflict set to prove that it is impossible for these inputs with this computation ever violate this output clause. This is really exciting because it allows us to add some high level constraints or verification conditions that we always want to hold to the code and be able to effectively check them and get more confidence and more assurance that our computations are satisfying the desire of high level semantics. Now let's jump back to the slides and I'll show a little bit more about how things can be encoded what types of things you can express in these assertions and what features the language has for making this easy to get these concepts in the right place in your code. So now that we've seen how the boss the language in the tool chain allows you to insert conditions and assertions to be checked and either fully verify them, or symbolically generate a counter example that shows a bug in your model. Let's look a little bit more at the range of things that we allow you to specify. So, for support, we wanted to make sure that it was very easy for developers to add these types of higher level semantic assertions and restrictions on what should be happening in the program, or their model. So we wanted to introduce a bunch of constructs to make it very easy to insert these and not require a bunch of custom code to handle all of this. And we also wanted to avoid the common problem of having a separate modeling language or proof language from the programming language. The language that you write the assertions in is exactly the same as the Bosch programming language and anything you can write in Bosch code can be included in assertion and can be checked at runtime can be checked by the verifier can be checked by the model checker, and can be checked by the tester. So this is all there and can be taken advantage of by all of the tooling so we were very happy that we were able to make this completely transparent. There are several places that you can add these assertions so a common one is that pre and post condition, as you saw in our example, and with all of these we allow you to turn them on either always. If you want to make sure that this condition always holds at runtime and will throw an error if it fails, even in release, or if you only want to have them turned on for say your debug builds, or you only want them there for use by the verifier, you can always you can set the flags that enable all of this. We also have data invariance which are very powerful. So if you look at this we have an entity or class called order it has three fields named quantity and price, and the invariant guarantees that anytime one of these is constructed, the name will always be non empty, and the price will always be greater than or equal to zero. So that allows you to count on this invariant anytime you're using one of these types, and it prevents you from accidentally creating an invalid object of this type. Now interestingly you'll notice quantity is not included in this invariant and we would assume quantity always has to be greater than or equal to zero. But we wanted to make it as easy as possible to sort of have a lot of common conditions expressed in the type system as well. We'll talk more about this later but this is one example where the quantity is declared as a natural number. So that actually guarantees that it's greater than or equal to zero and we don't have to wait until we get to the verifier stage we can actually check and catch errors around this in the type system. Finally, we support ad hoc checks and asserts that you can put anywhere in your code for any condition that you want to have verified or validated. And again here this shows a slightly more complicated bit of code stuck in there where we're checking if the name exists in a collection of existing orders before doing something. Right, so I want to emphasize that, again, the assertion language is arbitrary boss key code, not just simple inequalities or order constraints. So, as I mentioned we have more examples of these numeric types which are very common things you want to validate. And these are also known as unit types and we've borrowed them from other languages but what it allows you to do is say, I have a specific numeric type, in this case, decimal, and I want to make a derived type called us dollars or decimals, and, or sorry, percents, and I want these to be distinct types with distinct operations, and I want the type checker to be aware of that. So for example, if I had a variable payment that was defined to be a type us dollars, and I attempt to assign an arbitrary number to it. That's a type error. I want to create a literal us dollar amount and assign that to it and that's perfectly acceptable. We also through operator overloading allow you to build both overload custom built in operators, and define your own operators that allow you to do type safe operations on these values so for example multiplication. It doesn't make sense to multiply to us dollar amounts together. Right, that's a type error. But if you want to take 25% of an existing payment. This is a totally type safe computation will succeed and will produce the expected type of us dollars. So this is a really powerful feature that allows us to easily create custom numeric types that behave in an absolutely type safe way. We can do verification on, we can use the type checker on, and it also makes your code much more understandable and auditable from a human perspective. Going further with this, we've introduced a new concept previously undeveloped in Bosque called type strings, and we support two flavors of this. One of them is basically reg X validated strings. So for example I can declare a type called zip code, which takes five digits with an optional trailing for, and I can also create a type that has a parse operation with any custom logic in it. Now using these, I can say I can declare a very variable string string of zip code and if I attempt to sign it an arbitrary string okay, this will be a type error because okay is a string but I wanted a type string of zip code. I can say okay I'm going to declare some string called okay a zip code string, but this will fail because it doesn't match the reg X. In fact the only one of these assignments that will be type safe and allowed by the compiler is the one where that zip code string actually is a valid formatted zip code. So this allows us to add a lot of rich semantic information to many interfaces that today would be typed as it takes a string a string and a string and now it's a string that is a username, a string that is a password, a string that is a resource requested, and we can actually expose that to the type checker, making the code more understandable by the human, but also lifting this information so that it's available for the verifier to do a much better job with. So here with the these parsable strings, we can also do the type string literal construction, but it also gives us a way to do string literal objects so the parser allows you to take say a path at source test it will construct a path object for you, without you having to explicitly call a constructor. So it allows you to make this very transparent about what are literal object constructions without having to manually expand the constructor yourself. There's a lot more that we're doing in Bosque both from the programming language feature side and from the tooling side, but I just wanted to give you a flavor of what's going on there. And why we're excited and think this is really a new paradigm in the way the programming languages are built and software is developed. So I just wanted to give this code design principle, the desire to make the code transparent and easy to reason about both for automated purposes but also that really helps humans as well with things like these unit types and type strings, and the way it builds a virtuous cycle between adding more and more value to these high level intentional semantic assertions by having the tools that can leverage them, and then feed that information back to the developer. We're also really excited here, working with more fear and finno s to sort of build on the polyglot capabilities that are in more fear that don't restrict us to having to build a full runtime ourselves we can benefit a lot from all this framework that more fear is always done. And we think it's really exciting to be able to work in this finno s space where these high assurance models are very valuable and give us a chance to show the potential for these leading edge tooling and analysis scenarios. So, I think that concludes the set of things that I wanted to discuss. So I'll pass it back to the next section. Thank you. Thanks so much, Steven and mark this was great I really appreciate you taking time to be with us today. And thanks everyone for joining us today to, to hear about this panel. I'd like to echo something that Steven said before we go, which is that regulations are complex, turning them into code is expensive. And more importantly, sharing the effort means that everybody wins. So, come join us, check out more fear and our other projects at github.com forward slash finance, and get involved. Thank you so much, and have a nice day. I'm going to go. I'm double muted. Don't you love it. We do have Steven and mark on who can answer any questions from from the attendees if you if there's anything in particular you'd like to ask, and I do have Steven and mark you want to say hello put your video on to make sure that everybody can see you. Hello. So I do have a question the first question while we give everybody else a chance to come up with theirs. But Steven, can you talk a little bit we just heard from legend. Just before this and and obviously Rob introduced and I identified that there were some synergies can you talk about any conversations that you had about how the two different systems might be able to interact in the future. Yeah, I think one of the things we're really looking at is that legend is a great platform for modeling data and data interaction and transformation. And it's got a, you know, it's a fair amount of logic as well in terms of calculations and transformations. Morpher is very focused on the logic modeling aspect, and from logic modeling, you need data modeling as well. So it's, it's certainly dependent on having good data models. So if you take the two together. I think you've got a very powerful potential platform. And so that's what we're, we've been looking at. I think, yeah, I mean, I think that's just a great example of these two innovations from what would normally be companies that don't work on innovations together, we can actually look at these together through the finalist program. And it just makes a lot of sense. Great. Thank you. I have another one for you. Can you talk a little bit about how Morpher is used within Morgan Stanley I mean clearly there's the potential around red tech. And, and maybe some of your decisions around choosing to open source, you know that code that was built in house. Yeah, and some of the things that we focus on our increasing development efficiency. So the idea is that, you know, we all as developers like to think, or as application developers like to think our jobs are converting business problems into computer problems. And then we realize that there's all this other stuff that we need to do in terms of, you know, regulatory requirements and, and, you know, working with frameworks and complex things like concurrency, etc. And so some of the things that we focus on our are being able to holistically capture calculations and rules and then quickly turn that into running systems through cogeneration. And I think that's where that's where the opportunity here lies. And when you look at something like boss key that with all those advanced capabilities. Then we can look at a language and choose a language based on the capabilities and what we want them to do. And then have those languages target the morpher intermediate language and then use that to then generate running systems. And so, you know, I think from a business perspective, we tend to look at it in terms of regulations, rules like, you know, categorization and stuff like that and being able to look at a system holistically, which is often difficult when you have logic peppered throughout the system. Absolutely. And Mark, the, you're in the research area of Microsoft and if people wanted to get involved with the boss key language. How do they, you know, how do they go about doing that you've provided links before and, and what what's what are the areas of future development, maybe building on what you touched on earlier. Yeah, so the GitHub page is where we've been doing all this work and we've tried to keep this out in the open from the beginning. You know, right now we're still more in the experimental phase we've been doing a lot of prototyping rapid evolution of the language and so this, this is one reason why it's been great to work with everybody on the morpher team and you know, understand where our good ideas actually address the problems that they had and where some of our clever ideas maybe weren't actually that useful in practice and so, you know, having having their input and that real world experiences with great and helping us in some of these experiments. And so we, you know, welcome anybody who's interested in understanding either more from the theoretical side, who wants to just try some of these things out who we've had some people open some pull requests so it's great to have some community contributions from the code. And so at this point, we've, we've reached, hopefully a bit of a steady state where we've got a lot of the features that actually turn out to be useful. We've nailed down most of the things that we need to do for this verification tooling, and we're really trying to step it up and get to the point where we can integrate effectively with more fear and start providing some of the the value that we'd like to the overall project. And this is where it's great that they've set up this sort of collaboration page for us, or GitHub repo for us with the FinOS to start doing some of this integration in the not too distant future. Great, thank you. We do have a question that's just come in. What compiler does Bosque use and can you enforce catch specific regulations at compile time without being explicit. So, can you catch specific regulations that we will let me start at the beginning so the compiler we use I mean that's been part of the language development process. We converted into this intermediate representation which today we just compile the C++ for easy interop with everything. One of the great things about more fear is we wanted to basically live in this polyglot language environment that sort of dominates modern development environments. So if we're able to plug into their IR, we automatically play nicely with everybody else. So as far as things you can check, check automatically in the compiler and the verifier, you know, pretty much any language feature that comes up like we'll verify array out of bounds in accesses integer overflows if you want any sort of these primitive operations that can go wrong if you're a JavaScript developer you know undefined is not a function we can check for anything like that with that would pop up. Also, like I said we really wanted to make it easy for you to add logic that was specific to your problem area, and sort of these high level things like you know sanity checks about values being positive or not creating invalid object types would be if you're dealing with one of these systems that's communicating across restful APIs with databases with other data stores with arbitrary web services. Data is coming in it's very easy to have these sort of parsing and validation errors. So we wanted to make that easy to have this data validation layer right at the front that we can do extensive checking on and make sure that the rest of your code is going to be protected from the kinds of problems that can pop up there. So I answered the question. If it didn't I'm happy to go in a little more detail on something. Thank you. It's not as easy as you potentially like to enable attendees to speak but there's so if there's, oh, that was great. So, thank you. Another question is boss key supported through visual studio code. To do items we have a basic add in that gives you syntax highlighting support. As you saw in the demo, we, you know, all the tooling runs through the command line now we'd like to hook it up to have a very nice integrated environment so this checking just becomes a routine part of your workflow and really you're like hey it's easy for me to add an assert. And then I automatically checked I get all this great value you know feedback value from it so I just want to enrich the specifications and come up with a very, you know, the end product is something I'm very confident in its correctness, and that it matches the underlying business logic intense. Great. Here's another one. So organizations currently use rules engines to capture business rules. How does more fur and boss key compared to those. Who wants to take it first. I'll let you go first Steven. Okay, yeah, I'll go first so I think one aspect of rules engines is that they capture individual rules and less so holistic rules. And I think, you know, rules engines are really geared at giving users a kind of access to creating rules and then kind of running them undeterministically, which is very different than, than you would find and say like a regulation where it's very precise. And you can't, you can't really run it like that. And that's where the developer provides a lot of value so that's one big aspect and I think the other big aspect of rules engines is that they tend to dictate the runtime environment. So they control the rule creation and then the rule execution. And so in doing that you're locking yourself into a bit of technology which when we're talking about regulations again we we want to avoid that we want to be as open as possible to different runtime contexts. So I think those are two major differences between a rules engine and then codifying rules, but then generating execution out of that. Yeah, and I think I want to echo this. So a rules engine or even codified rules, you have each rule written independently, and then you hope that all of those implementations of the individual rules, ensure that the intended specification and behavior of the overall system is correct. So I think push of Bosch is to allow you to specify in the same code, some of this high level intent, and then be able to verify verification tooling to connect those individual rules to those high level intents, and make sure something wasn't lost in the intermediate translation or the development there. I'll finalize that anything you can do with a rules engine. I mean a rules engine is a nice UI over programming in the end I think. And you can you can basically accomplish the same thing and then have those rules compiled down to to code that's that's very precise rather than as Mark mentioned very isolated. So you can get the same kind of user effect of users can manage rules without them knowing about the intricacies of programming languages, if you want to get into that kind of UI. Thank you I hope that answers the question if not feel free to pop another one into the Q&A. There's another another question feeding on a little bit from from some of the earlier conversation. How open are more for an boss boss key maintainers so you guys, amongst others for accepting pull requests from the wider open source community and you know I know mark you've answered this a little bit. So maybe you want to touch upon to what, what different types of skills you need to have or different experiences in order to be able to, valuably contribute. Yeah, there is the first question is about openness to accepting those pull requests. Yeah, yeah, I mean I think this is a good time to add it on the morpher side that I'm. Yeah, a community driven. And the reason that we, we open sourced it from Morgan Stanley was that we realized that the community could provide more value than a closed source system. I think what we've seen is that when people kind of get to the concepts of, of automation and cogeneration and that that it opens up a different level of, of innovation. And so we're hoping that with the community we get that same level of innovation and contribution so we're totally open that was the whole reason that we open sourced it and then the final as model itself is very open. I think it's, it's very easy to be able to, I mean there's just one, you know, CLA and then you're enabled to contribute so. And then in terms of what kind of skills, basically anything I mean from from the documentation to to, you know, very complex programming language theory and anything in between. And, you know, from the from the Bosque side we were academics so we love to share and learn together and that's that's the best way to do it so we wanted to start open from the beginning. And, you know, thus far, most of our contributors have been other academics and the model has been they've been more forked off the project, done some independent research on it. You know, share back has been publications or their learnings from this. You know, we've also have a number of people who have had smaller contributions documentation bug fixes feedback on language features in the GitHub repository. And those have all been phenomenal as well particularly for identifying some design choices in the language that were not ideal. So challenging for sort of the more traditional contributions just because we've been in such a rapid state of churn and change and experimentation. But, you know, hopefully as I said we're, we're over the big hump on that and you know it'd be great if people are interested in, you know, compilers type checkers all that sort of stuff. You know, we'd love that we also love people who come in and just give some feedback on trying to build a couple hundred lines 500 line project and how it worked and what didn't work and you know open some some some bugs. That's awesome as well so we're very open to everything. And that is if you want to do something as advanced as, you know, write a paper as part of your PhD thesis to I found a bug, and I just wanted to let you know about it we love all of that. That's great, thank you and that really does. I think epitomize our community that that sense of openness and real desire to collaborate and that's my open sourcing in the first place right. And I would just add that within the open reg tech initiative that that I mentioned that finance launched earlier in the year. We've had, you know, a number of opportunities to present to present this to regulators and with, we're actually quite pleased and surprised with the openness that we've had in response. You know, there are some, we think of regulators and rightly so being somewhat closed off and maybe not as advanced in technology as, as, you know, some other members in the industry but we have been pleasantly surprised at the, you know, the willingness and actually the desire to make changes earlier. Ian was mentioning also the, the digital regulatory reporting and, and I think that, you know, using these open tools and this transparency to, you know, much better to in much better ways define what those implementations are instead of running through that you know, 6,000 pages of the text as you as you showed Stephen, and validate the implementations and so we are really optimistic that through projects like this and the initiative and the openness that we're seeing from the regulars to explore how this can be, you know how they can leverage this work that that will make some strides there too. Are there any other questions or any closing thoughts that you have. Fair enough. So, thank you, Stephen and mark, very much, and, and I Tana, your time and contribution as well. And I think James is going to introduce our next panel. James, do you want to. There he is. Stephen, thanks again. Thank you very much. That's amazing. Thank you very much to the more for team for insightful kind of discussion this afternoon. So for those who didn't see my talk earlier I'm James McLeod director of community at Finos, and I have the great pleasure of introducing our next talk this afternoon. So I'd like to introduce you to Paul Graves senior vice president and lead technical architect city and Andrew Carr, head of consulting for Scott logic in Bristol. And I've had the great pleasure of working with both the data hub and data helix project teams over the last year on creating kind of like synergies between two projects which I'm pretty sure both Paul and Andrew are going to talk through today so Paul it's over to you. Great thanks James. Can you hear me okay just check the microphones working. You are good. Fantastic. Thank you. Great. So yes. So hi I'm Paul groves. I work over at city group as an architect in the client and boarding space and thrown over to Andrew say hello. Hello, so I'm Andrew car I'm head of consultancy for the Bristol Scott logic office. What's the goal. So, um, let me get that working right so data hub. That was a contribution that city group we made as our first contribution to Finos earlier in the year. Very briefly, data hub it's a synthetic data generator. We support two modes of operating which is you can kind of handcraft your own rules, or you can use production, you can use like existing data sets, data can analyze it, produce like a statistical model out of it and then it can use that for synthetic generation to try and preserve privacy. So kind of support these kind of two things. Now data hub it's largely what we've done there is we brought lots together lots of different kind of open source existing libraries together and try to make a very, you know, seamless consistent interface across them so there's lots and lots of tooling out there, but the problem was there's a lot of time you had spent to try and bootstrap these projects up and so we decided to give it all a bit of consistent interface and then we've done a bit of our own work as well on top of that as well. So that's kind of data hub what it is and I guess Andrew data helix. Yeah, so data helix I guess came from a slightly different point of view we tried to build a tool to start with to mostly be used by our testers to be able to generate test data really quickly so the aim was hopefully without coding they could write a config file, which described the rules of the data, and then run the tool to to generate some data. So similar field to data hub but actually a different use case which I'll get on to a bit later so essentially what we ended up doing is having a system that described this own language which is what you're seeing on the screen right there, which is a data helix profile. And this is a playground that you can you can find on on Phinos. And if you modify the the profile data profile on the left, and then you press run, then it will show you some sample generated data. And it really is the idea is to satisfy the use case of generating large volumes of data quickly for doing low testing, or stuff like that, where it would just simply take you a decent bit of coding to write your own data. So next slide please, Paul. Right. So synthetic data. I guess, you know, what is it has been a lot of talk about it recently. It's all a bit mysterious, I think. So, really simply put synthetic data is anything that you can algorithmically generate. And there's lots and lots of different ways that can be done. Lots of different kind of applications for synthetic data. So you can go back to things like the oil gas industry, the medical industry, you know, government statistics they all heavily use synthetic data approaches. And then I think probably a bit more one day also from finance, we tend to look towards synthetic data for data privacy. So, you know, taking production data, or as Andrew just said, you know, just generating test data from a bunch of algorithms and other inputs. So largely with synthetic data, there's two broad approaches that's that kind of defining your own rules. And, you know, you don't understand the domain you understand the constraints of the application or the data and then you essentially handcraft those rules and you just generate a whole load of data. Yes, there's a second one where we do the data analysis. We try and do use a bunch of statistical analysis. There's lots of different ways that we've done, particularly GANs are a big area of research and a lot of commercial products are now using GANs to analyze production data, you know, work out distributions and then use that kind of statistical model to then forward generate the data out of that. So it's not particularly a new thing. It's something that I think there's been just an increasing amount of talk about it as well. So a bit of history, I think, you know, if we looked at what we think of synthetic data now it actually probably started back in the 90s with the US Census Bureau. And they had a problem where they wanted to get the, they had kind of a couple of problems. One is not all the census data is complete, you know, people might partially incomplete the forms. There might be a lot of missed households wouldn't even respond. So they wanted a way to statistically populate the missing data with kind of what they would thought would be good data. And that's what we kind of think we often now use things like machine learning for. So they came up with the statistical approaches to do that. Now, the other thing they want to do was they wanted to give out the data sets in a safe way. So redaction had a lot of problems where you could essentially re-identify people pretty quickly. So they want to take their big US Census data. They want to then transform that into a, into a same data set that was publicly shareable and you had no risk of re-identification. So that was kind of their idea. And there was a whole load of techniques that they came up with in the 90s to do that. And so that's where a lot of the work comes from now. More interestingly, or more fun, if anybody is in their 40s and used to have a BBC micro spectrum, you probably recognize this elite. So this was the whole procedural generation route. And what they were trying to do with the Abley was had very limited resources, tiny amounts of memory. But they wanted to create an entire like galaxy of 256 planets, eight galaxies that you could travel around. They wanted to make it seem very, very realistic and really kind of give you this immersive experience. So they went down this approach of actually doing procedural generation and they had a very couple of very small data sets. It was like a couple of like string lookup tables. The whole thing would have taken like a couple of bytes of memory, the actual data they needed. But what they could do was consistently then generate this entire galaxy on demand as it was required from that very, very small input set. And that would create this kind of little galaxy of planets that will be of all their unique properties. And that's really what procedural generations about. And again, elite back in 1984. While it, you know, lots of things had been using procedural generation for them. It's probably the first time we really saw somebody do procedural generation and make this kind of real lifelike immersive experience. So that's kind of, you know, a little bit of history about where we come from. So yeah, since I just nothing particularly new, but there has been it has been coming up quite a lot. There's been a lot of new vendors coming up around in this space as well. And I think that's for two reasons. One is we've had all the GDPR and data protection. And I think a lot of us, we've all gone down the same route of trying to do the data redaction data anonymization. And that's quite an expensive process. And often I think the results have been a bit unsatisfactory like particularly once you know redaction, we've all had problems with re identifying data. And if you're anonymization quite often you end up with this, just this massive ball of redacted fuzz that has no use in your test systems at all. Also, there's been quite a bit of interest from machine learning and AI. So how do you create realistic events, generate realistic events for your, not for your training of the model, but for your kind of your back end engineering data pipeline. So how do you get extremely realistic data into your, into your kind of AI ML kind of pipelines that you can be doing your development with. Of course, you would still always use the real data to train your model, but you know that kind of bit before it. I think some good examples are, say you're doing your AI process or your ML processes needs some new data tagging, and that tagging is not being done yet. But your modelers need a bunch of data to start working with. Well, other changes have been happening to try and do that tagging process. Now as you kind of separate that out a little bit as well, you generate some data upfront that you might not actually have access to. So yes, redaction, anonymization and synthesis. When it comes to data privacy in your development systems is your free approaches. Redaction to remove sensitive information. So, you know, customer names, you know, social security numbers, all those usual bits. Anonymization is a fuller process where you're not just redacting, but you're going to make sure that there's no way to re-identify out of that redacted data set back to a real person. And that's actually quite a complicated process. Or we go down to synthesis. And as we said, there's two synthesis ways. You do a completely procedural waste handcrafted or we use the existing data sets. So when we talk about redaction and re-identification risk, this is a very simple example. You know, you could take a data set, say it's your HR database and you've got, you know, people's sensitive information on religion in there, like salary and other things, you know, disciplinary, whatever. So if you just went through and just remove people's names and maybe just took a few other fields out and then you put that into your test system. You, it wouldn't take a genius to prickly pretty much quickly work back out who the people were from, you know, other information that's internally available like your address book, particularly because you've got something that's like there. The department they're in and they're even just simply the department they're in and their corporate title. Quite often you might only have one person one corporate title in a department. So that's a really easy thing to just go and re-identify all that information again. That's a very, very simple example. There's been a lot of case studies into re-identification and you need this. You only need like a couple of attributes often to be able to go back and re-identify. And that's where we get into the area called differential privacy. And the point of differential, we could probably spend hours talking about this, but the basic idea of differential privacy is it should be mathematically impossible to take a statistical set of data and then backwards to work out back down to original individuals again that should just, it should be impossible. So actually differential privacy, it's kind of quite firmly rooted in cryptography. And so we might just try and stay away from this subject a bit today because we'll just go off in a tangent a little bit. But differential privacy, if you're doing data analysis to then use the synthetic generation, it's kind of quite a key thing in there. It's very important. So how are we doing for time? Yeah, so if you look at like a typical kind of synthetic data flow, you'll start with your production. No, you've got kind of two bits. You've got the stuff that you need to do in your production environment and then the stuff that happens in your non-production environment. So you might start with, you know, in your production environment, you need to get your data out. So you query your data, you kind of do a basic bit of redaction, make sure the PI attributes are kind of removed. And then you'll do a bunch of data analysis on that and then produce some kind of statistical model like your differential privacy file. And that's normally a big aggregation and a bunch of statistics about the data it sees, you know, relationships, deserved correlations and whatnot. So now you've got that file and it should be cleanable PII. You shouldn't be able to go back with that file and re-identify anybody. It's just a big statistical model of that data. So what we do is we take that, then we can do synthetic processing over the top of it to then generate back at, generate back out all the row level information. And then we can add back in PII-like attributes. So not the real PII attributes, like somebody's real name, but we can then generate fake name, fake addresses and fill in all those blanks that we've, those bits we redacted out as well. And then we can write that to a prod database. So hopefully what we've got then is in our non-production database, we should have a whole load of row level data that looks like the real data, but it's no way to ever, it feels like the real data. It feels realistic, it feels, you know, immersive and whatnot. And it should be decent quality. And it should obey all the rules that were observed in the production data. So that's kind of that's the, if you've got an existing data set available. Or we have the opposite route, which is there's no data available, or you kind of don't want to use it and use code for straight procedure generation, which is fill up a handscraft a bunch of rules. You produce a whole load of data. You put it in your data store. So it's quite simple. But obviously there's more time consuming to do this. Your developer, your person who's watering those rules is going to have to have a much better understanding of the data, the constraints, the rules of the data, and we'll have to have much more innate knowledge about it to be able to do that. So cool. I think that's a hopefully a very brief view of sensitive data. I guess over to you, Andrew. Cool. Yeah, so I'm going to talk to you a little bit about the typical use cases for synthetic data and I guess how they impact procedural generation. So as Paul mentioned, there are a few different approaches, obviously taking normal data and then the modernizing and redacting it has its challenges. I'm not sure, but I suspect Paul's probably going to chat about that later. Certainly one of the issues with doing that is you can end up with data that's in no way useful. You readact it and anonymize it so much to the point where it's not actually useful in terms of fitness for purpose. And I do think when I guess Paul and I have found certainly being involved with this fitness data generation, we've ended up having lots of conversations with people about their use cases that they're using synthetic data for. And the more you talk, you begin to realize that actually the solutions for synthetic data, it really depends upon the use case that you're trying to solve. So for example, if you're looking for kind of lower volume, highly accurate data to test your functionality, you're probably going to go for a very different solution than different. The other use cases, which are more like your kind of high volume reasonably shaped data and reasonable accuracy, which you might be just doing something to do a like a low test, or if you're actually using some high volume data that you want it to be statistically accurately shaped, because you're going to be using it for modeling, machine learning, analytics, etc, etc. Paul, can you go to the next slide please. So I guess depending on which case you're considering, I'm going to go through a simple case and demonstrate why the rules based approach is really good if you're going for I guess the kind of load based testing where you want the data to look reasonable. It doesn't have to be that accurate. But you want to generate the data really quickly. And then, you know, at all like the data helix and there's there's a number of them out there, where you can just use a conflict file to modify the rules can be a good approach for that. But I guess it's not such a good approach. When the data you want needs to be highly accurate, and you need it either for, I guess, functional testing, or you need it to be, you know, statistically valid shape where you're probably not going to want to go down different route. Cool, if you can hit the next slide. So let's let's consider a really simple use case. So like I said, the kind of low volume, highly accurate data, let's see what we can do. So imagine you define the test data as a really simple and this is just five columns. This is a massively simplified example, but it's just really illustrate how complex rules can can get really quickly, even to get data that on first pass looks reasonable. So if we have some sort of trade ID, a stock ID, whether that's a RIC code or QCIP or ISIN or whatever, a stock name, some sort of price. And again, I'm simplifying it so we're not not including currency and all the things that you do typically do actually need to include to be useful and the trade date time. So if we hit the next slide. So if we if we just literally defined, we needed the fact that we needed like an ID, another ID, a name, a price, and a trade date and we just let a generator go off using even you know some really basic modeling would clearly get nonsense data. And so from a, you know, I suspect from a low point of view, this wouldn't even be that useful because the stock ID is probably the wrong size. It probably doesn't contain the typical characters that stock ID contains the price clearly contain the wrong number of decimal places. If you're doing a low test with this, I suspect while this might have some use it would probably have negligible use. So it's not really, it's definitely not usable for functional testing. It's probably not usable for volume testing, and it's definitely not suitable for machine learning or statistical analysis of anything. We go to the next slide. And then we get a little bit more specific. We can certainly go for enumerations, for example in the stock ID, and then go for enumerations in the stock name, and then we define a little bit more the range that we want the price. Again, these could all be rules language we use is, I guess in material. And if we put a little bit more rules around the date time as again, we say it's I don't know greater than a week ago or less than today. And even with these sets of rules, if we head to the next slide, we can see that while the data looks on first pass a little bit more sensible, it doesn't take us more than a second to quickly realize that the stock ID and the stock name don't match, which which quickly brings you to the conclusion that you really need to do dependencies between the columns. So if you have any row of data, if one of the columns is one value you probably want to narrow down the field of values of the other columns. And in particular in this case we're all aware that the stock ID and the stock name clearly need to match, or we're just generating data that doesn't really make much sense. Now for volume testing that might be okay, depending on the situation, but possibly not. But for functional testing it's clearly not going to be useful. And it's definitely again not going to be useful for machine learning or any form of statistical analysis. So if we go to the next slide and try and tighten rules up a little bit again. We can do dependencies between the columns and say okay the stock name has to line up with a stock ID. Either you could have some sort of air for you could have some sort of enum where they, they say they line up or however you wish to do it. But if you had a bit further say the float price needs to be two decimal places. And then if you go to the next slide and try and generate again. Okay so it's starting to look a little bit more sensible in the stock ID and the stock names now line up, but clearly the price is swinging all over the place. So this might start to get to the point where it's useful for volume testing where it kind of looks okay enough. But for functional testing and machine learning analysis, it's not enough. And this has been a really simple situation where we're just getting started and getting the data anywhere close enough to be useful for volume testing. And I guess for for projects I've been on just involved with recently, we were looking at a situation where we had a table of trades, where the trades had 140 columns, and terms of the rules based approach, you needed about 3000 rules to get the data to look accurate enough to do volume testing on. And even then it wasn't accurate enough, I guess to do functional testing. So, I guess I hope I've demonstrated even in a small way how quickly these rules can build up. But it definitely gets to the point very, very quickly where you need a very large volume of rules to get accurate and and going forward if we head to the next slide. If we if we start to look for a more complex situation and trade so we've gone for the really simple situation if you consider something like a bank account generating data for that. You can clearly see that you're going to need start to need dependencies between the rows as well, and also some sort of state, because you're going to typically expect salary to come out of the same day every month. You're going to expect the same big outcoings every month rent you're not going to expect that to be some variable that massively varies with over the place. You're going to also hope that the outgoings are less than the incomings. And also you're going to want the amounts to be realistic such as if people get coffee at Pret on weekdays when we used to go to offices and stuff. And also you kind of want the number of realistic events to be realistic so you know you're not going to want someone to go to prep 300 times in one on Monday, and then once on Tuesday. So I guess what I'm trying to get across here is just the more complex and the more accurate the data has to get and certainly when you're leaning towards functional testing of statistical analysis, you really need to start looking at either a different approach which isn't the rules based approach, or you need to just add so many rules as to the point of it's almost going to be as complex as the business logic of the application itself. Okay, next slide. So I guess this is just a summary so if you're looking for volume testing and the data has to look sensible rules based approach approach can work. If you want super accurate data for functionality testing and validation. You're going to have to probably get involved in doing your own coding and development because the complexity of the rules are probably going to be such that it would need to be as complex as the application or close to it to be able to generate the data properly. But if you're looking for data which is the right shape. As Paul mentioned you can use one of those techniques that you've used machine learning or something else to look at the data, extract the shape and then generate data which is a very similar shape that you've seen. Next slide, Paul. This is my slide actually. Sorry. So, kind of examples of some of the things that we've been using synthetic force so particularly in client onboarding, how do we generate kind of realistic onboarding events. There's, you know, it's very heavily workflow based. So there's a lot of complexity in those workflows, you know, only certain steps in workflows occur with, you know, particular client types and the rest so that's one big thing that we kind of you started this problem on. Things like generating risk in PNL with realistic sets of values. So that's just kind of big volume data. The constraints of the data is relatively simple. Then we also have things like since actually generating portfolios of trade. So, you know, just creating portfolios of slots for example, credit card payments between motion card holders and one of the main reasons we've all seen this is an even easier relationships with vendors and cloud providers when we're trying to have that exploratory first part before the NDAs get signed and all that complicated stuff. Like, you know, if we've got a truly public data set which kind of describes our problem, you know, that's much easier to hand over. So that's kind of the kind of some of the examples that we've been kind of looking at and using. Moving on to the next slide. So onto data hub itself where hopefully we'll give a bit of a demo and nothing will break. So data hub relatively it's a Python library. Now that's to say, we're heavily based on things like that pandas data frames. So what we do with data hub is we just return back data frames back to you. What we do with that data frame if you want to go and populate a database put into a CSV file. That's kind of outside the scope of data hub bears. There's plenty of other Python libraries out there that will handle, you know, populating databases with data frames for you. So that's why we kind of sat around with pandas and got we didn't really go down the whole data syncing area. But yeah, just you type pip install data hub core and after a couple of minutes of all the dependencies going down, you should be get going. So we'll do that now because it will take a couple of minutes. We'll use a system thing I've just set up. So first of all, just going to quickly show you how to boost your project up and handcraft a couple of rules generate say some client data. So if I tap out and over to here. So take that out first. So that way. So what's going on here. We maximize that up. So we simply call it we've just imported data hub here. It's all Python based and the way we script the rules again, it's all kind of in Python. So it's very much in the Python ecosystem. So we call like a data hub generates and we give it some properties like some descriptions of the attributes that we want to generate. So though we're just saying create me a field called region and pick one of these values and use these weightings and give me a hundred of them and we've got a fixed seed number here so use a fixed seed number. Every time you generate it, you'll get the same data back each time. So if we were to remove that completely, it would be completely random each and every time. So we also allow for consistent generation as well. So if we just save that for example, we comment that out as well just going to be useful later. And we you run that you can see we basically get a list of 100 regions and they're just the burying. So it should all match the weightings that we've seen. Go back over. Now let's say we want a country field. So we've got a we've got a client and it's in a region and now we want to put that client in a particular country so we can do something called gen dot country codes. There's an in built actually regions are in built as well but so that was just to illustrate to generate some country codes. Now, what's going to happen let me change that to 20 actually, and then we run it. Right, so you can see again we've got it here and now we've got this very ugly object in there, a country object which doesn't look very nice for you. So let's go back and we get a bit of post processing to just get the alpha free code of that country. So that's what that line of codes doing. We go back and run that. And now we've got a we've got the country alpha free code. Very quickly, you're going to go this is rubbish. You know we've got country Poland and we said it's in North America, we've got India in North America, India in a package is clearly not very good. So what we're doing data is we kind of understand things like regions and countries and currency codes. So what we can do here is we can give it a straight. We can go region filled region and this generation function will understand that constraints that we're putting into your country. Now, if we run that again. It should look a lot better now so not now USA North America Australia and eighties and eight pack Ukraine is in a mere and so and so it was all good. We've got a few more bits and pieces so let's pick an industry. And so you don't watch me type stuff out I'm just going to go and steal from a, a mother. So let's go back. So we're on like an industry type and a camera in there and a sick code for this client we're making up. There's just been a post processing because these are actually objects and there's lots of different elements you can in that object that we can do so to keep the rendering nice in my screen. Let's just add that post processing back. Have you render that now. We need to fix that. That's why it's sick code. Sorry, run that again. We've got a little bit more so we're saying we've got a region the country now we've got industry so mining, and then a particular kind of element of the mining so this is just using all the public standard industry code information is out there so again data hub So we're going to generate a legal name for that company and it's going to be specific to the industry and the country that's in so we're going to give it these extra parameters say look there's this that's where you can find the industry code for it and this is the country it's in. So again we can name things accurately, depending on the country and the industry. Again, let's go in and run that one more time. It's. Here we go. So, now we've got region country industry. We've got the sick code and then we've got a name. So, you know, here we've got company in finance, which is more to bank is a loan correspondence. We've called it like hold the loans group for example. There's a construction here called that's a bit of a horrible name. You see we kind of try we kind of generate appropriate names for for that kind of industry and country. If we go back let's go and get some numeric numbers and there's all sorts of ways we can handcraft how we want these numbers to look so with at the moment we're just going to do it very simply just pick me around a number between these two points. But there are distributions that we do we can say look use a normal distribution particularly around a particular with a particular standard deviation. So we've got all those kind of normal distributions it will find. So assets under management and estimated value. Let's just do that example and run it again. And now you can see we've got the AUM and the UV mode done over here just random randomly generated. So that's kind of how we can quickly handcraft rules together and get very quickly a reasonable looking data set. So now if we head back over to the slides. The other one was like well how do we do analysis of data so I'm not going to run this live because it can be a little bit slow and probably got better things at the time and watching a now clock hourglass tick by. So what we can do here is this very first line that you're seeing up here called analyze run so we're going to give an input data set or CSV file and we're going to ask you to output like a model file is done. Let's look at it a little bit about the data set the bits of it that we're interested in so essentially what are the discrete values and what are the continuous values. So we say the discrete values are region countries that code industry and so on. And then we say the continuous values in our AUM and EV. And there's a bunch of other bits of pieces that we do. There's different kinds of analysis modules so there's one which is called the fast bucketing model which runs very very quickly and generates data I would say a reasonable quality. We can go into that baby later and go into how that works if people are interested. But there's also things like we integrate CT GAN. There's another one called STV. And there's a couple of models as well. So there's another one that uses it uses like linear regression, for example. So there's different kind of models that we can plug into this to do the analysis and then they all basically output this a file. So what we then can do is we can then write another function which is then generate the data. So we're going to call the gen.generate from model. We're going to set it what the model is so we call this we're going to use this fast bucket model. And we are going to supply this model file.json in there. And then what we can do is remember if we go back to that slide earlier where we said about adding back in like PII attributes like things like names and addresses. What we can do here is in this properties area we can add back in like a name and we can tell it about where it will find the particular bits it needs to generate that name in the data set itself that it's generating. And then we can basically redecorate the data with these PII attributes that should have been stripped out. So if you notice over in this analyze.run we're not we don't put any kind of we only put the elements in that we actually want to do the analysis on what we want to fit to the distributions. We don't put things like people's names in their social security numbers, bank account numbers, all of that you add back in afterwards. So if we go back in that earlier picture, this generate model function, that's what we'll be running on against our production data. And then this second function generate that's what will be running in our development environment to actually produce the data and then do something with it. And I said you end up with a pandas data frame so you can easily just call data frame dot to CSV. And there's plenty of like things like for populating Oracle databases with data frames in there as well. So that's pretty much what data hub does. It's pretty plugable model. We're constantly trying to generate more and more models and refine these models. Now I went through differential privacy thing at the moment. It's not as tough yet as some of these models has been able to do for differential privacy, getting kind of close but there's a lot of legwork that we need to still do just be able to actually make that claim that it's done. So that's something that we're trying to really aim to get done. Also, data hubs incredibly easy to extend. It's just Python. So if we wanted, if we wanted our own generation function that was unique to us that you quickly wasn't part of data. So you should just write these couple of lines of code here. You need two functions. One's a partial function. So this is this is a extension we're doing, which is going to just say hello, a name or something. So again, we can see we put the region in put the country in do then we do the name. So we're going to generate a person's name. We're going to make this extra actual message. And we're going to call this little extension function we just wrote here. So what we're going to do here is the data set should be saying hello, Paul, hello, Bill, hello, Jennifer, and so on, whatever that name is that will, we'll see it in that message. So that's so that's how it's quite easy to extend the functions there. I guess what's next with data hub and data ethics and Andrew chip in here, if you want to. So we're going to start merging these two together. So we're particularly interested in the spec language that data ethics has. So how can we use that in data hub as well. So it's much easier to probably just write a quick JSON file than is to get down into Python. So if anyone's at first writing Python, we can actually still use these data profiles that we've got in data helix. Also, we wanted to have a big investigation into the legend language that I think was presented much earlier on. And so can we take legend data type specifications and then can data hub understand that and then start since it's been generating data to that specification and to the constraint language that's in legend. So that's that's a kind of a big area of research we're looking at right now. And so we've got data version two coming quite soon. At the moment, we only support when we come to analysis. We need discrete data as well as continuous values. We can't handle data sets of continuous values only. So that's a bit of enhancements coming along. We'll take the multi table support as well. So actually you can say you've got the classic kind of, you know, customer supplier database, you can export those as CSV files or whatever. And we'll be able to and you can supply those as like a multi table with understand it will understand the relationships between the different tables, you know, the form key and primary keys. So multi table supports coming and we'll finish actually with the CT GAN integration as well. So CT GAN is an open source. Dan model that generates synthetic data. And so we're going to bring that into data hub as well kind of wrap it over and give it that same consistent way of working as you would with all the other models. So hopefully that's going to be in the next month we'll be making that big PR and it'll be available. Then in we go to the version three features. So a big question is people always ask you as well how big a data set can you generate at the moment we're reliant on pandas data frame. So the question is always well how much memory you've got. And that's how big the data set can be that you can generate. So our techniques we can do to break it up into smaller parts, but really that's that's your limit. So we're looking for spark integration. So we can actually run these as like spark jobs on a deep cluster, for example, or and then generate really truly huge data sets over in Spark. So the integration, that's what we're looking for. We've got this other element. So I think earlier dates type predictors. So I think earlier when we saw here, when you do this analysis, you've got to pass in the discrete continuous values. So what we'll look for is there's a another bit of work that we've got going on where you give it this input set, and then it will go and predict what all the types are, and which should discrete which are continuous which are PII. And then it'll automatically kind of generate these these parameters here for you or and present or present them to you in a way so you just, you know, double check what it first of all. And that data predictor as well we're looking for it to understand things like is it QCIP is an eyes in your standard industry codes. And so not just that's a string, but that's a country code and that's an eyes in your back. And yes, that's the next bit so there's a few more financial types that we need to build into this so understanding intrinsically, we already understand things like industry codes, any eyes, and so on. You know, do come we start understanding things like QCIP buys in curves, tenors and all those other current general financial data types as well. And again, it'll be really interesting to start seeing how we can integrate that with with the with the project. And lastly, this is probably going well into next year but we're going to finally bite the bullet with agent based modeling. So we can actually see if you start using data hub for more simulation based things you know around agents. Andrew, do you've got anything you want to add I'll throw the mic over to you. No I think you did a brilliant job there Paul. Yes sorry lots of talking have to drink a lot of water now. So I'm Andrew and poor hope you don't mind me cussing in now, but I think it might be quite nice to throw the Q&A open to our attendees in the audience. So if anybody's got any questions to ask them both Paul and Andrew, feel free to use your Q&A button that I believe is right in front of you to ask your questions. But in the meantime so each of these questions going to be to both of you, because I know data hub and data helix, you know they're similar but they are also different. So Andrew first, who is data helix actually geared towards so who your customers who should be using data helix. It's an excellent question I mean we developed firstly when we were doing a project for a company that ingests large, basically flat files of trades, and then sends those trades on to the regulator, and they basically enrich those trades and do loads of validation on those trades, and really isn't regulatory reporting technology. And what we needed we needed to generate lots of fake or test data to test their data in QA environments, because a lot of their QA and dev environments were in the cloud. But obviously their QA and test environments weren't secured in the same way that their production environments were, but also they didn't want to use any of their production data in their test environment so we really designed it for testers to be able to get some synthetic data up and running super quickly. Also for that particular project, they built or they had their own spec that their engine did to validate the trades. So we built a UI that created those rules, and then we thought well once we've got the rules why don't we create a synthetic data generator that uses those rules and generates data that fits them. So it's really aimed at testers or developers who don't have much time and want to generate some large volumes of test data super quickly, but don't want to end up writing a whole program to do so. That's amazing. Thank you. And Paul, I don't know if you're able to answer the same question. Who are your customers? Because I know that this is a city project that you developed and so it would be good to know who you actually developed it for. Yeah, sure. So this started off Life City at something called E4, which is our Distinguished Engineering Program, which that's where I kind of offered it. So originally we were aiming at the area I work in, which is client onboarding. So we had some very complicated client onboarding events. So it was a struggle to work with vendors a lot in this space, particularly when it's coming down to APIs, how do we work around a particular data set we want to do? So that was probably our original use case that we had. And then slowly internally it grew to our big areas, so creating portfolios of trades for various asset classes. I think there was some other elements where we're looking at kind of payments and credit cards for some of our clients as well. So working with quite a few clients where actually some of our clients ourselves wanted to generate synthetic data, so we started to engage with them, kind of helping them out. Yeah, and then after that, it's kind of, yeah, I guess we're kind of looking at more and more use cases we go. So we have a working group that James has been running for us and I think now we're going to be. So, yeah, if there are use cases out there, I'd love to hear from them and we can always try and look at the roadmap. That's amazing and Paul, as I asked Andrew, the next question, it'll be awesome if you could share your screen and actually show people the data hub and data helix repositories on GitHub. So we can actually show people how to get involved. But whilst you're doing that, Andrew, I'd like to ask you, are there any potential synergies across other Phinos projects for data hub and data helix? Yeah, I think Paul and I are very interested in chatting to, I guess, the more fear developers and also the legend developers, because from our point of view, if there's areas of Phinos where they're looking in the modelling space where you model things, obviously that that kind of spills over into once you have a model of the data, can you generate synthetic data from the model. So it turns out you probably need a little bit more information than just the model because often the model would describe the type of the data. So it's an integer and it might have a meaning, it might say actually this is a RIT code. But it doesn't give you rules between fields and often complex rules that you need if you're going to generate data that looks like the data you want to generate. There's definite synergies and we're very keen to talk to them. And then we're keen to work out actually what above that modelling do you need, such as do you need a kind of a shape of the data overlaid on the model to give you enough information to be able to generate realistic synthetic data. But it's definitely something that we're very keen on having those discussions. And I think it's a good first port of call to understand how that good sync together. That's amazing. Thank you very much, Andrew. And Paul, I can see in the background you've been given us a bit of a virtual tour of the data hub repository on GitHub. Would you mind just giving us a bit of a voiceover and maybe taking us through the types of issues that you know are within the backlog and how people can get involved with the team. So, I guess, reach out to any of us. So, Andrew Carr, myself, James, there's also Ben Fielding as well from Jensen. So we're all part of kind of guess this working group under Finos. Now, I probably say always the best way to evolve is just get on and raise an issue and to start a discussion. And we're really happy to get a discussion going on there. You know, and then we read what we are particularly looking for is anybody from AIML data science type backgrounds that want to, not even if you're contributing code, but we'd love to hear your ideas, just to help us with creative creative thinking. So, you know, people want to get in and start actually helping contributing codes. Fantastic. That'd be great as well. So yeah, I'd say you always best way to get involved is raise an issue and start discussion. That's amazing. And what type of contributors are you actually looking for within Data Hub and also Data Helix? So maybe I'll ask that question to Andrew first. What type of person would you like to get involved in the project? For me, both, I guess, architects in the working out the synergy with the other components, developers because we want people to use it and find it useful, but also testers so that they can kind of keep talking about how easy they find it. Now, I know certainly from our point of view, again, you know, if there's a way of seeing how these synergies work, but also playing around, I guess, with the Data Helix playgrounds, you can go online, you can change your profile, you can press Generate and actually see, you know, what the rules can generate. So feedback on those rules as well. And that can be from anyone. So Paul, I know that you're controlling the screen at the moment. So if you can remove issues from the URL so we can see the entry point for Data Hub. And then in a new tab, go to Data Helix, because I just want people to know that I'm Data Hub and Data Helix are two separate projects. And so if you'd like to find both projects on the final sort of the first is Data Hub, which is on Paul's first tab, and then the second is Data Helix, which is on Paul's second. And so Tosha, I believe you also have a question. I do both Andrew and Paul, feel free to ask them. I do. Thank you, James. So I worked in a previous part of my career in electronic foreign exchange, and obviously there's lots of big data sets and, and I'm wondering if how easy it would, it would be if I was still there to generate a data set of realistic, essentially tick data for particular currency pairs to run it through a back tester and evaluate if our, you know, algorithms were, were working the way they were supposed to is that is that a use case that could be met through through either of the data. So I'll just talk about Data Helix first of all, and then hand over to Paul for Data Hub, not with Data Helix easily, if I'm honest. I think the rules are such that one of the products which look at the kind of shape of the data and then try mimic it would give you much more realistic data. Unfortunately, because Data Helix uses rules, it doesn't have rules between the rows. So if you did, for example, you know, something that was ticking, it would jump all over the place. It would be 35 one second and then 102 the next second so it wouldn't be realistic enough. Now, certainly, Paul and I have discussed kind of, you know, how you get something to do the correct shape of the data. And I think the tool you would need to do that would need to mimic the correct shape of the data, such that the data wouldn't just jump all over the place. It would slowly go up or, you know, it would go up if it was going up quickly would still go up in steps it wouldn't just jump up, jump down, jump up, jump down, jump down. So for the moment, Data Helix would be a no, but you know, looking forward, I think with Data Helix and Data Hub looking to come together, I think you could end up with enough components, including Data Helix to be able to do that. But yeah, I think Paul has to talk about the data, Data Hub. Yeah, so for us, when you, I think I showed earlier how you can write your own extensions. So I don't do it quite now. But it's actually quite easy to write your own extension functions. So really, I guess if you're looking at time to his data is the last couple of values you've generated are an input to the next value you're going to generate. If you do it very same, we basically pick or have a number pick a random number between one or 2% from that number, maybe from the previous number, if that makes sense, maybe you would that maybe you want to generate something like that. So you can do we can do that with your own extension function. I don't handle the analysis yet. There is a library. There is some Python work that's been done by some people out of Cornell University that does do analysis of time series data, and then again through GANs again. So I am actually kind of looking at how that could be integrated in Data Hub as well. So, and if they're things mature enough yet to use it. So the answer is kind of a kind of you can. Fair enough. Fair enough. Thank you. I have a another one back from my FX days. There were, we, you know, we had electronic trading happening in multiple different areas across the firm, and some of that information you couldn't. So actually on trade data, not real time, you know, ticking data but trades that were executed. I don't know about them, but because of data privacy laws, you can't pass that between that information between jurisdictions. So, is that a use case where I could essentially, as you were talking about earlier and, and this is not an area of expertise but the privacy plays into it. Is that aware of a use case where I could pass the data over from one region to another, knowing that I have safely not passed any client information or specific regional information that wasn't allowed to be shared. Got it. So I think that's more a problem that would be lend towards redaction and anonymization, because what you what you don't want to do. And with that you want to, you want to straight remove it. Yeah. Otherwise we'll be making up the trades and it'll be a bit scary. While theoretically the shape of all the trip you did like generated 10,000 trades hopefully they'll be assisted the same but can quite guarantee that. And then Andrew I saw Stephen go bump from more fur actually came on cam to ask you a question say Stephen are you there. Yeah, yeah. So one of the things that we've been interested in more fur is the idea of in domain modeling that you want to sit down with your business owners, and kind of come up with a model in real time together. And that would be generating sample test data so as you're coding, you know, imagine like a decision table. It's got a finite set of values. And it would be great that as you're coding and adding that finite set of values in an enum or something like that that you can generate the sample data on the side, and that can run and your users can see that in real time and see that. The rules that we're coding right now actually pass those tests. Is that something that you would see would be doable. I mean you can definitely do that with the data helix now with the playgrounds online playgrounds. I mean it's depending on how complex the rules are would determine how quickly it responds, but you can do that you can you can. You can do a live demo just because I'm scared of doing live demos and things always go wrong when you try them live but you can just edit the rules in the web browser press run, and it will regenerate the data. And I totally agree with you Stephen I think, ultimately certainly with users when they see the data in reality that's when they go, oh no that's not right that can't be that because of that and that's when you get really useful information that they just happened to have forgotten to tell you. When you go and code it and come back and they go, no, no, not like that that yeah if it's interactive you totally get that response. Yeah, cool. Thanks. So that's great and so before I ask the next question I just want to remind people in the audience that you can actually ask your own questions by using the Q&A facility that's in front of you on screen. Thanks for your time so Andrew I understand data helix is actually written in Java and Paul, we know that data hub is actually written in Python. As you've been exploring the synergies between both projects, can you kind of talk about the ideas that you've been having to bring in both projects closer together in terms of functionality. I guess for me, I think Paul and I have been talking about three things in terms of synergy one is architecture, two is API points and three is language. I think the data helix was originally written in Java, I guess for a couple of reasons one because of the project it was involved with to start with two because the resources are useful Java developers I had them available. It kind of made sense. And three at the time, quite a few of the organizations we thought would use it. We know were heavy Java users. Looking back now though, I probably would have done it in Python. So I think the move forward is to sync up with the data hub and really move any data helix components we want to keep into the Python so that yeah so that the data helix and data hub become components in this, you know the thin-off synthetic data components I guess. And Paul, is that something that you agree with? Okay. I think he agrees with it because we're moving all the components over. Yeah, sorry, I muted myself and actually I muted myself to talk, I didn't realize I was on mute already. I'm muted already. So yeah, it's obviously something I completely agree with. And I think one of the benefits of Python in this space is just the weight of that data science community that's behind it. It's by far clear the winning language. And I think also within most financial firms now, most of us are using Python to glue quant libraries and all sorts of things together. So it's just a, I think in that space is just a clear winner. Mainly because I think also with Python it's if you could particularly found in the synthetic space every time I go, I'm going to code this thing together. I'm a quick Google search fan so that somebody's already built it and I can just install it and then expose it, you know, make a few modifications around the public and expose it through Data Hub. That's brilliant. And I'd like to remind everybody, you know, who's actually watching us speak that data hub and data helix are open source projects and so they're here for people to utilize now. I don't need too much on the spot. Can you just introduce, you know, the read me since you know how people can actually fork and clone and, you know, bring the projects local to to test before they get involved, you know, with the community. So, okay, so the most simple way is starting Data Hub is. Let's just do this quickly. And it's going to generate a Python virtual variant, right. So we can see Chrome at the moment. Hopefully, interesting that you can see Chrome. So apologies, let me stop that share. And start the share again. Share the screen. That's it. I can see there we go. Excellent. So, activate that. So, getting some data is simply as simple as install data. I can't type today. And you can see it's bringing it down. And off you go. So you want to start with it. That's what you can do. And then all you need to do is go into one of our test folder folders test, pick a test. Cut and paste that in. Put it into a Python file and then just run it and it will work. So that's if you want to run your own. Otherwise, again, the easiest way to normally start with stuff is, let me maybe start another window up. Where am I? Data Hub. So if you were to clone this again, is again, start with your virtual environment and just run PyTest. And that will run all the tests and show you that it's all working. So just explore all those test examples that are there. So that's actually my death machine. So that's a broken. So of course it's not going to work because I know. Yeah, it's lucky I didn't start from there. But there's a good example of the tests are broken. But that's fine. And so both the Data Hub and Data Helix community is very welcoming for people to join the repositories, bring them down locally, install the virtual environments and run tests to get it up and running. And then also if they find issues to feed that back into your, into the GitHub issues for the project and get involved with the community. Exactly. So, and that's brilliant. And so as we lead into the final two minutes of our session this afternoon. So Paul and Andrew, do you have any kind of final asks, you know, for people who are actually watching us this afternoon? Is there anything that you would like people to get involved with where you need help? Just so you know, we've got to know an entry point for anybody who wants to get involved. Yes, Andrew, do you want to take that one? If not, I can. I shall take that then. So I think some good areas that we're really interested in. So there's certainly that, that link up with Legend. Morphe also sounds really interesting. I know we've got a couple of people on from there. So again, you know, what synergies are there for that area. I'd really like to know kind of any missing functionality. So real, real primitives that might be missing for somebody, you know, if there's anything there. Another thing that's going on is creating a set of examples based on the Eastern CDM specs. So that was the thing that also we engaged with in early in the year. So could there just be a whole load of examples of like this is how you make a this type of is the contract or that type of is the contract. And there's just a little bit of Python script there that you can just cut and paste and off you go. You can start since actually generating that that is the object. So again, that's that's a really nice easy area to get involved as well. So anybody with is their understanding. Fantastic. Yeah. I'm trying to think of some of the other areas. Andrew, have you got some days? No, I think you've covered it all for actually, I think, yeah, just feedback. If anyone's got any use cases like Paul says, the more use cases we do the more we kind of get to have experience. On, you know, what would work well with the projects and what we know how the projects need to change and evolve to hit new use cases. Yeah. That's, I was just going to say that it's been an amazing past hour of being escorted through data hub and data here. Thank you very much Andrew Carr and Paul Groves for taking us through both of your projects and with that I'd like to hand the group back over to Tosha Ellison. He's going to introduce the next section for us. Thank you. Thank you very much for that. So in our final session, I'm, I'm, I'm pleased to have a presentation on perspective which was one of the earlier contributions that came from Jake Morgan into the foundation and so we've really had an opportunity to develop and see the community grow, which is, which is always nice and it's certainly something that is an area that I'm interested just from my personal background and, you know, which has broad applicability across industries is how you present complex fast moving in ways that's easy to interpret and how do you give access to your, you know, your users to be able to to make sense of that data too and of course that ties into some of the conversations in the previous one so with that, I'm pleased to show this presentation from Jun Tan in the perspective team. I will stop my video. Hello everyone. Thank you for virtually attending. My name is Jimmy and Tan I go by June. I'm a software engineer at JP Morgan and for the past two and a half years I've been part of the team working on perspective, an open source streaming data visualization engine that's built for large streaming data sets. And being a final subconference I figured that there are people in the audience who've already used perspective. You might have heard of it you might have used it a little bit you might have even contributed to it for example. And so for this presentation I want to talk about something that's pretty new in terms of our project we've been working on it for a little over a year now and it's taking perspective out of the WebAssembly context and porting it to Python and more specifically using the using perspective Python was in JupyterLab to do streaming data visualization in a whole new domain in a whole new way. So I wanted to start off by talking about the sort of two predominant ways that you can do streaming data visualizations and interactive visualizations which is a Web dashboard and in JupyterLab or using Python. So Web dashboards are great they are interactive they're easy to use you can have really visually beautiful visualizations arbitrarily complex UIs, but at the same time not everybody wants to write JavaScript and not everybody can. Python and data science is really the de facto language the de facto standard, and it's difficult to have somebody, you know have to learn Python and do JavaScript and sort of the UI part and the data part at the same time. Look at JupyterLab for example it's extremely powerful it's Python is a de facto standard JupyterLab is the standard, but at the same time it has to sort of drawbacks that Web graphics and sort of Web dashboards really excel at which is that it's more difficult to write UIs in IPI widgets and as a data scientist as a developer you might not want to spend all of your time doing custom UIs for each individual notebook or even, you know, kind of hand rolling your own UI library to do these sort of visualizations and transformations. So perspective in JupyterLab really kind of takes the strengths of those platforms and provides something that's not only easy to use not only interactive not only, you know, intuitive and has really a lot of power just in doing the sort of analysis and visualization, but it allows you to do it on top of the power of the compute power the threading the parallelism of Python, as well as the extensibility of Python, as long as you give perspective some data that you know it can visualize it it can pivot it can transform it, and you can use Python to create arbitrarily complex data sets arbitrarily complex data sources and as long as you give perspective some data it'll work and you get the UI part you get the dashboard part almost for free in in JupyterLab. So very quickly, for people who aren't familiar perspective is a interactive visualization engine for large streaming data sets. It was built at JP Morgan for the trading business it's still used at JP Morgan and by developers institutions around the world it's open source it was open sourced in at the end of 2017 and now it exists as part of a fitness project and allows you to not only transform data in the browser and in the engine so it allows you to apply pivots filters aggregate sorts computations using a really high performance C plus with core engine that's compiled to WebAssembly. It also allows you to visualize data we have a custom data grid and we have a D3FC based data visualization package which D3FC is built by Scott logic who I know someone from Scott logic speaking at this conference I'm sure there are Scott logic attendees and it's a really great visualization product that we're, you know, extremely happy to use and very happy to have people from Scott logic contribute to perspective. And it runs in the browser initially but now we've ported it to Python as a standalone runtime and in specifically in JupyterLab so to kind of look at this how to do streaming visualization in JupyterLab with perspective, we're going to take a very common scenario we're going to take a fictional portfolio stocks, a basket and we're going to visualize and we're going to analyze it and we want to do two things we want to do real time analysis we want to have streaming data that comes in and we want to do we want to do pivots we're going to do aggregations etc in real time. But we also want larger data sets of historical data and we want to look at that and you know do it in the same sort of context do it within JupyterLab. We don't want to have to write different data sources we don't have to do different fetch API for example in a web dashboard or connected to different data sources. Using a web framework we just want to do it always in JupyterLab and most importantly, we want to do as minimal transformation in code as possible. So there is data cleaning there is sort of getting the data into the right format for perspective but we don't want to do more calculations in the data we don't want to have custom calculations on a data frame for example or a JSON object we actually just want to take the data pass the perspective and have perspective do all the heavy lifting and you'll see that it's actually really quite easy and very intuitive to have that sort of framework. So I'm going to start off by doing a very quick rundown of the perspective basics the API. If you look at the browser versions if you've used the browser version the UI is exactly the same. The API in Python is pretty much the same as the browser version and we've designed it so that the two are completely cross compatible. So I'll start by just importing and creating a table so obviously this being Python and Jupyter you can now give pandas data frames to perspective we handle that very quickly. And you can either create a table the sort of base container for data from a data set so from this data frame here or you can create it from a schema which is just a mapping of column names to column types. And with the schema we're going to create a table that has an index and this is important later on because we're going to be creating with various data sources we're going to be creating a unidirectional data flow where the data comes from one source the IEX cloud API that we're going to be using to get our streaming and static data. And it flows through a series of perspective tables almost like a graph if you if you will to actually create to store data to manipulate data to allow you to do all sorts of different analysis kind of all based on this one day to flow. So if we get started we create this table and we create a view which is a query on the view on on the table excuse me so in a sequel analogy it's kind of like a table is a sequel table and a view is a continuous query where you can update the underlying table as many times as you want but you never have to recreate the view the query can stand on its own and it gets notified of new data coming from the table. And you can always recerealize the view out into whatever format you'd like and it will always be up to date. So we'll create again a unindex table and index table which has a primary key and we're going to update it so if we update the unindex table and we create a view and we query it we'll see that it's sorted here from 2500 all the way down. And if we update it again but we don't recreate the view we just re query it you'll see that we actually get the most up to date data so you've got 2500 and 1500 and created a couple of seconds later. And it appends on a unindex table but an index table it will actually update based on primary key so if I do an update here you'll see that it overwrote some of the rows and appended some rows and it left some untouched. And this is again really important because we're going to be using primary keys and indexes to store some of our data in certain ways where we only want to store one row with a given primary key. But in another table we want to store every single row that comes in regardless of primary key so for example in this scenario we want one table that has the latest price for every single stock. We want one table with every single price change for every single stock and that's where the index really comes in. And we're going to do this sort of linking like I've talked about using on update which is just a callback that runs every time the table updates. And this is kind of where the sort of joining comes in, the joining data sources. We have a table that has an on update callback, we set it on the view and the callback runs and it takes the rows that are updated and it passes it to a different table. So it passes it to the index table. So now if I update the unindex table and I query the index table you'll see that it actually updated the index table with all this new data. But it also looked at the primary keys in order to apply it properly. So if I were to actually look at table.view2df you'll see that I've got all this data right and because it's unindexed it appended all of it at the end. But within the index table we have it updating based on the primary key. So this is really important we'll get to it in a second. You can serialize your data from a view so you can do to data frame to a CSV to to JSON and that's all great. But the most important thing is actually you can serialize your data to arrow so perspective integrates is Apache arrow extremely extremely well. We load Apache arrows with almost a zero cost load by copying the underlying memory into perspective. So loading Apache arrows is 50 to 100 times faster than loading a regular data frame. But we also allow you to write data out to arrow and we allow you to pass these arrows back and forth. And again it's a really easy way to work with Apache arrows which is you know a super high high performance binary data format. But being a binary data format means that you really have you have to use their library to to parse it to do to write it to do different things. And perspective offers a visualization a transformation layer that's extremely intuitive on top of Apache arrow. So we save the tarot we can save it to the file system. If I load it you'll see I've got example dot arrow. And if I re if I if I load it from the file system you'll see that this is all the data that we just had. And we can sort it we can do whatever you know anything we'd like on it we can group it we can filter it for example we can do as do as much analysis as we like. And this is kind of the the the meat of what we're going to be talking about which is using perspective widget and using the perspective UI but within JupyterLab. So you'll see that here we're going to create a perspective widget it's the exact same UI that you're used to on the browser. It has all of the same features as the UI on the browser it's completely cross compatible. It allows you to not only not only not only interact with it on a UI level so I can take away a pivot I can add a pivot I can add a filter maybe I want it to be 2.5. Or maybe I wanted to sort it by B or get rid of the sort or add another pivot or I guess D is no in this case I guess we can pivot here but it also allows you to interact with it from the Jupyter kernel. So if I typed in widget row pivots you'll see that I can I guess C and if I typed in widget row pivots equals none it unpivoted everything. If I type in widget reset it'll reset the entire widget except I guess not that one for some reason. But you can transform the state from from Jupyter but it also means that you actually can save the state within a within a library. So you could very easily, for example, instead of just initializing the widget you could you could have these options you could set them as you wish. Contributes right or you can do it later on the fact so again it's a it's a whole different way to interact with perspective instead of having to sort of save your. Interactions or the state of your actions for example in in a in a way that might be cumbersome in the browser doing it in Jupyter doing it in Python is extremely easy. And we can also do streaming data in in perspective in Jupyter so obviously streaming data is a big point here. But the issue with doing it in Python or doing in Jupyter is that we can't block the main thread we can't let the notebook be blocked while we're writing this new data while we're rendering this new data. So we're going to use hardware threads on Python and we're just going to create a new thread we're going to run perspective in a separate thread and we're going to have it update the UI as it goes. So if we create this widget right now it's completely empty and if I start this thread you'll see that it's actually running on the separate thread as we go along. And I can do any sort of now site likes I can sort it I can pivot it again I can do all these different things with it. And I can also while this is running I can use the kernel I can still continue to use the kernel as much as I like you know I can check the size I can create a new view. Let's see if I did row pivots equals what's a good column name. And I did to Jason and 100. So it's querying it and because it's multi threaded it's it's you know dispatching those was in a way that's not going to block the main browser thread and not the main browser thread to make the main kernel thread, which is really important. And so from that sort of overview or perspective now we can look at very quickly the data sources that I wrote, which use the IX cloud API to give a streaming market data and also static market data. And it's all dummy data it's really easy to get a sandbox URL and to get a sandbox key. So we're going to be using that but again the most important thing was these data sources is that they don't block the main thread. And again we're going to use multi processing we're going to use threading and we're just going to hook them up together. All of this code is available online all of it's open source so definitely feel free to take a look at it. You know play with it play around with it if you want. And so we're going to get on to the sort of main part of this presentation and we're going to start by doing our imports again and importing our schemas and our data sources and creating a PYX client and PYX is the IEX wrapper library. In Python and it's really easy to use really simple to set up and gets us our data really quickly. So we're going to use that it's great it's great library and we're going to create a portfolio stocks just some random stocks here tech stocks. Really no particular reason random randomly generated it's all dummy data and we're going to start by setting up our data sources. So like I said earlier we're going to create a few tables and we're going to link them up properly to create this data flow. So we're going to start by creating a table around what stocks we have right now a holdings table. And we're going to want to index it because we just want to know right as new prices come in. If the prices all append we're going to have to do some pivots but if we didn't we don't care about that information really in this table. So we're just going to store it with an index which means that we're always going to get the last most recent price and most recent value for our portfolio. And so we're going to update here and then we're going to create an unindex table which is going to store all of the price changes for all of the stocks as an our portfolio. Which excuse me is going to be important because we obviously want to know how it changes over time. And so we're going to create that and we're going to hook it up with an update like I talked about earlier. When the index table updates it'll take the row and it'll push it to the unindex table which is basically just a dump for all of the data that's come in so far. And I have also a very interesting architecture diagram that I cooked earlier and we can look at it here. So the architecture diagram is basically we have the IX cloud API and it's all one way. So the data stream comes into quotes table which we're going to be creating in a second. When the quotes table updates we're going to update the holdings table. When the holdings table updates again that's the index table we're going to update the unindex table. And from the unindex table we now have all of the data that we've been getting and we can dump it to disk using Apache arrow which allows us to stream it, allows us to store it, allows us to pass it over a network boundary, to pass it to different notebook, to send it to somebody using perspective, to host a server using perspective on that data. Really there's a lot of possibilities. So if we create these tables we can now create a widget. And so we configure this widget and most importantly we configure this widget with a computed column called value. Right which is a column on the table on the widget that actually calculates the value of the portfolio by multiplying quantity and price. So this kind of goes back to the requirement we had earlier where we don't want to have to touch code as much as possible. We're not doing this calculation in Python or at least in user facing Python. The user doesn't have to write a code that transforms the data before it goes into perspective. They can do it completely within perspective. In fact they can do it from within perspective using the expressions API which again is actually something new that we've built this here which allows you to write these arbitrarily complex expressions that will update as your table updates and it will resolve itself properly. So if I were to do, see if I deleted that and I did the square root of value for example and I push it out. It will actually give you all of these columns and you'll see when we start updating the table these columns will update at the same time, in real time basically. So now we can create another widget for our total table. This is going to be configured as a line chart that's just going to look at all the prices as they take it. So again looking at the sort of data flow what we've set up so far is we've set up these two tables and these are going to be our holdings. And now we can set up a quotes table which is the table that's actually going to be taking data from the IEX API. So again using an update when the quotes table updates we're going to send the data to the holdings table. And when the holdings table updates we're going to send the data to the total table and so on. So we can create two more widgets here. One of them is just going to be a grid, sorry, just showing us the latest prices. And the other one is going to be a chart showing us the latest prices coming in and we can start to data source. So we can do a little bit of cleaning. The sandbox data doesn't have the right date time set and so we want to have it sort of look live in this case. They're randomly generated date time so here we're going to actually just have date time now as a sort of proxy and we're going to start this data source. So now you'll see this data is coming in live right now, right? It's updating here. It's updating on our quotes table. These two actually being fed from the same table. It's updating the value of our portfolio over here. So this is the value of our portfolio right now over time as it, you know, increases or decreases. Because it's dummy data will probably stay pretty still but you'll get the point. And here is our columns. I think I need to change this to not know and now it'll update. Zoom is at zero for some reason. I think that might be a data issue. But you'll see that as our prices update on this index table, all of the computed columns are updating at the same time just in lock step. Right? And again, going back to one of our original requirements, the main browser, the main kernel thread, my bad, is never blocked. I can do as much, you know, queries on this as I'd like. I can do quotes table size, for example. Ah, quotes, excuse me. And I can see it continue to tick. I can again do the same thing. I can get another view out of it. Let's do two dict and row or start row 100 and row 150. And that's 50 rows in a dictionary out. And I'm going to clear that output just so we don't have to look at it. And we can also do other things like we can take this, we can take this output, we can put it in a separate view, we can have it over here. So now we can look at our data as it is ticking live, that might again be sort of dummy fluctuation. And we can also do, we can also create more computer columns. So for example, if we didn't care about the data ticking in every second or every sub-second, we can have it actually look at it per minute, for example. So if we bucket by time, and we change the pivots here, we can see that now everything's being aggregated by the total value over time. And I think this might be wrong. There we go. Yeah. So it was because it was doing a minute, it was adding together the value at every second. So we change that to last, which is the basically the last value in the table for that minute. And this will give us the correct value for our portfolio. So you can see, and we can also get rid of the split. So now this is just the, I think this might be a sum, and we might just have to repivot it. But you can kind of see the point of it allows you to do as much analysis as you'd like on in JupyterLab. And you have the UI, it's given to you for free. You have the ability to do complex visualizations. And you also have the ability to do static visualizations at the same time. It's just as powerful. We can back test using historical prices. So a little bit more cleaning here, and we're going to get the data for the past five years. And we can create another widget here, and we can render everything. And now we can see this is, excuse me, the open high low close for the S&P 500 index ETF. And again, if I didn't want to look at it every single day, I could maybe look at it by month. Bucket, actually, I think. Yeah, month bucket date. And if I were to bucket it here, you can see the actual open high low close for each month. And I can take that, I can pivot it out by symbol. Now I can see it on all symbols. I can also, again, filter it down to something more specific. So if we can look at SPY, we can also look at Snap in this case. Again, completely different charts. And then if we wanted to look at it on a more granular level, again, we can just take off that pivot. We can add this pivot back in. And now we've got an extremely granular view of per day on the open high low close prices, right? But again, all of this is extremely easy to do using perspective using the UI that's already been provided. It allows you to hook up a complex data set from Python from Python kernel and allows you to do it with very minimal transformations within the code by the end user. So imagine an end user who's looking at a notebook or a voila app, or if you're preparing a report using a notebook, for example, you can give them the notebook and instead of kind of looking at pre-generated charts or having to dig into the notebook to regenerate your charts, you just show them perspective and they can do all of the transformations, do everything interactively within the UI. And one of the other really cool features that I wanted to talk about was this sort of integration with Apache Arrow. If we create a view here that just has the value, we don't care about the price, but we're calculating the value and we save it on another thread. We've just saved everything to the file system and we can open it up. And again, there's quite a lot of power to this, right? We can do exactly the same analysis that we did earlier. So, well, maybe not like that, but we can do it here and split it by time. And now we suddenly have all of the prices that, you know, we've been looking at, except in a static form, obviously. It's not updating live, but you can see the sort of power in this where if you wanted to, for example, build a notebook that ran live analytics over a day. And at the end of the day, you saved it to disk and you ran it for 20 days and you wanted to look at every single day of analysis. You don't have to go to the web API or whatever data source you're fetching it from and try to reconcile everything or try to maintain a separate database. You just store 20 arrows and you load them in a row and you can put them in the same widget. You can put them in the same perspective widget and there you go. You can host a web server. You can run a remote perspective using perspective Python and perspective in the browser. There are a lot of possibilities. And so one of the things you can do here really simply is we're going to restart it just to make sure. Open it in a separate notebook. So now, again, we have all the same data again. You can do all the same pivots. You can do everything in the same way. Or you can actually open a perspective workspace that's running on a perspective service. It's getting data from a perspective server. And actually this whole time I've been running a server that is in the background that's using perspective Python. And it's actually going to have, I think the render thread might be blocked, but it will be. There we go. If I update all the points, you'll see that it's been running for about 40 minutes now. And it's still getting a fault because I can tell it should still be getting live data. And if we were to, for example, bucket this by minute, you'll see that we have each minute here. And so this really brings together, again, the sort of paradigms of a web dashboard and a Python back end. But it's more powerful than what you can do maybe in Jupyter lab. But there is a little more setup with a web server and you have to deploy it or you have to run it somewhere. But it allows you to create even more interactive dashboards that are fed by Python, all the status synchronized. It's extremely powerful in that you can, you know, if I used last, there we go. It's extremely powerful. You can, you know, have all of the power of perspective workspace. You can have this, for example, be a global filter. You can click on it to filter on the rest of the dashboard. You can package a dashboard. You can deploy it at an endpoint. You can send it to somebody. You can dump the data out of here into a CSV, etc. All of this. It's all, you know, there's a lot of possibilities there. And so I think this really brings my presentation to a close. I wanted to thank you all for attending. I wanted to thank the organizers, both at the Linux Foundation and at Finno's for putting this event together and for inviting me. Everything that I've talked about is open source. Every single part of the code that you saw is open source. Perspective is fully open source. The notebooks, everything on this presentation is online. So feel free to take a look at it. Feel free to reach out to me or on the perspective repo or with any questions or any ideas. And thank you guys for attending. Thank you so much, June. That was, that was fantastic. You've already answered. One question online, maybe in case anybody missed that, do you want to just touch on touch on that question and answer? Hi, I don't know if you can see me. Okay, awesome. Yeah, perfect. Thank you. Yeah, I so the question from let me read it out again from Adam Jones was about whether we can create recordings that can be rerun or view it at a later date. And I'm not exactly sure what recording means in this sort of context. In Jupiter lab, like the notebook state is always persistent. So as long as the kernel was alive, it'll be there with the perspective viewer with the widget actually the way we've implemented it is that the viewer state will be persistent even after the kernel was dead. So the Python kernel could be, you know, crashed or stopped or whatever but because perspective is a wasm based, you know, engine and we were actually able to run the exact same data set in wasm at the same time as the Python kernel. So you're able to basically see your state you're able to do any more transformations whatever you'd like within the UI. You know, long after the kernel is dead. And you can also serialize your data out to different formats you can save the config of your widget. And it's kind of part of that's enabled by the platform of Jupiter lab and this sort of notebook architecture and part of that is the way that we've implemented perspective widget and I, I don't know if that fully answers the question but I was my take on what kind of recording meant in this in this case. Thank you. And Adam, we can always put you in touch with June afterwards if you'd like to carry on the conversation but he says yes that does so thank you. If there are there, I know we're just we're a little bit over time so are there any other questions before we wrap up. Great, June and everyone else thank you so much for for joining us and as everybody has reiterated this is all open go check out the repose figure out how to get involved. You've you've heard from our, you know, project leaders that everybody is friendly and, you know, and really welcomes contribution so please do take the opportunity to to join our community. And with that, I will thank you all very much for attending.