 All right, looks like we've got everything connected, so I think we can probably kick off. I guess the first thing, thanks for joining the session. We'll start by giving some quick intros on ourselves. So my name's Andrew Omarad. I sit within our data engineering group. And on the one side, I look after our business intelligence platforms. And on the other side, I look after our data governance tooling within the alloy ecosystem. I've been with the firm for around 20 years and probably spent the last five years in the data space. Hi, I'm John Pencilow. I've been at Goldman about 11 years. I work in the data engineering space as a developer on the legend platform. I've worked on the legend and pure platforms for around eight years now. All right, so before we dive into the main part of the talk today, we just wanted to give some background for those that are less familiar with legend. Legend is a project that we open sourced through Finos back in October 2020. And one of the goals was to open source some of the technology that we used within our own internal data platforms for a number of years. Internally, those products are known as alloy or legend, so sorry, alloy or pure. So if I use those names, they're largely interchangeable, although it just varies which parts have actually been open sourced and things like that. So there was one aspect that was the technology. The other part was really to actually provide a place or a platform for people to collaborate on financial models and have a way to describe those concepts and use them in a way that is formerly expressing the business concepts rather than describing them in words or documents that people have to reinterpret. So one of the key goals early on was actually around using that as part of the pilot was using that to collaborate and contribute some changes that ultimately were proposals for the CDM data models. So from a code point of view, it's on GitHub and is actively contributed to by our developers each day. So parts of that is open sourcing the existing alloy code features into the legend code base. The other parts of it is actually contributing new features initially going straight into the open source version and then being consumed back internally. So to talk about legend as a data platform, really it's an ecosystem of tools and also a way to think about your data. It takes a relatively biased view and wanting people to formalize the concepts of their data sets and the relationships between them. And really the idea of that is to, well, one, bring trust and understanding to data consumers so that they know what they're working with, but then also from a data producers point of view, make it easier for them to provide data services that can scale to many different consumers who may have slightly different questions or are looking for different insights. The tools aim to help all the way through from discovery, right the way through to governance. And one of the key components within the platform is around, is the data modeling side of things. So the goal is to enable people to express kind of business concepts and map that to the data that provides it. So it's not providing directly to the users the physical way the data is stored, but actually expressing it in a way that makes sense in the business domain. So internally we use legend for both creating models within applications but also for our core enterprise data models where that kind of trust and confidence and traceability in the data sets becomes critically important for those data sets they feed into a number of regulatory reports. So I'll hand over to John to talk about the problem space today as we, oh actually I guess I missed it. The aspect there that we've focused a lot in the past on the models and the existence within Studio but really what today's talk is focusing on is how do we bring the information that's in those models and expose them to other analytics sources and tools. Yeah sure, so thanks Andrew. So yeah, as Andrew just said, a lot of what you've probably heard about legend is very much this model in space and building out strong data models using strong modeling concepts and building out big graphs of models which is fantastic. In the wider world a lot of the tooling that people use work much more in the flat space, in the tabular space. So if you think of pandas and data frames if you think about a lot of BI tools they integrate very well with relation tables and if you think about a lot of other tools you use a lot of what you're visualizing data in is fundamentally grid tables. So how do we bridge that gap? How do we allow people to use the data models, use the power of the legend platform and everything you gain through that but still be able to use the tools that everyone knows and loves and consume your data and visualize it. Now we want, as I said, we want people to just kind of use the power of the platform and really what we're trying to avoid is people kind of just in their BI tools hard coding database connection details and just connecting to raw data. We want them to be using and consuming their data via the models that the data producers have actually provided them. So where do we start? We start, as you'd expect, with a data model, the graph world. On the left here we've got a very, very simple data model but we need to flatten that somehow to get it into these tools. So how do we do that? This is where the pure language and protocol comes into play. The pure language and protocol is similar in nature to something like GraphQL but really here we're gonna focus on what we call the project function. So here what the project functions allows us to do is navigate through that graph and perform complex operations on that graph, filters, expressions, calculations, et cetera, but then be able to actually project a flat table view or tabular dataset, TDS output. So if we look at the example on the screen, the code block, you will basically see the project with a list of eight columns that will end up being your columns in your table. If we specifically look at the fourth column down, we can see there a very simple navigation from the source model and then going through the product information to the manufacturer names. This is a graph navigation ending up with a column. So now we've kind of got this project function. We can actually then wrap this up and bind it to the connection details and the mapping that the producer has provided to make something that's an executable unit. This executable unit can also go through a full software development lifecycle process, review, release, et cetera. So we've got a nice executable well-governed unit. So now we've got that. How do we actually go from that and then using all of our BI tools to get that data? So this is where the legend TDS protocol comes in. TDS is a tabular dataset. Now, the legend TDS protocol is firstly modeled. We have a data model in legend as you'd probably expect. And there's two real concepts within the protocol. One being an input and the other being an operation. An input being, like I've just kind of said, a legend service or there's many other examples of inputs you could put into it. And operations being standard sort of relational kind of operation you'd expect. Things like sorts, limits, filters, joins, et cetera. The protocol itself is designed to be fairly readable. So when we show you some of the examples of the protocol itself, the names of the functions, the parameter names in the payload you see are very readable. And it's very visualizable as well. It forms a pipeline of operations from an input source with all your operations actually stacked on top of that. And so it's very visualizable. We have tools where you can pass it the TDS protocol graph and it will just show you the graph of all what's happening or the specification of what you've asked for. It's also very extendable. It's very easy to plug in new input types and new operation types into the protocol. So that's really the basis of it. It's a protocol for defining pipelines of operations. Now we move to the other two parts here, which are client libraries that are built on top of the TDS protocol. And we start off with the Python library. Now the idea with the Python and the SQL library, both of them really, is to get a kind of natural native feel. So when you're in Python, you're writing Python code as you would normally write. In SQL, again, you're writing SQL or interpreting SQL. So the Python library is directly integrated with data frames. So you essentially create a frame on an input and then push down further operations on top of that using sort of lander expressions on top. And the SQL, very similar, allows you to write SQL expressions where essentially the from statement is from your input source. So it'd be like from the legend service that you want to actually execute. So I suppose, why are we doing this? Why can't people just connect to the database tables? And a lot of it comes really down to separation of concern between the consumers and the producers. The consumers want high quality, trustworthy data, but they shouldn't need to understand the complexities and technicalities of the actual raw data itself, the database tables, the joins, the authentication, the scaling, where the data actually belongs, et cetera. But the producers, the people who kind of own that space know it, and they want to be able to abstract their users away from that behind this data modeler. So a single input or a single sort of interface into their data via a data model and hide the complexities away from the users. The legend platform itself also gives you a lot of other benefits, both for a consumer and the producer. So things like operational metrics we've traced in and usage analytics and lineage calculations, which if you're just directly connecting to the database through all your BI tools or in Python or whatever, you don't get that. So we out the box, by consuming your data through the platform, you get all them benefits. So we can move on. So how does this really kind of work? And the kind of key point I think to this slide really is the TDS protocol being the common interface from all of the BI tools. And the TDS protocol being able to then be pushed into the legend engine to then interpret the pipeline of operations that the user has asked for and be able to push them all back down to the data source. So the user is not consuming all of the data and then doing client-side filtering. Every operation, the filters, the source, the limits, et cetera, are all pushed down through the engine, back to the underlying database, for example. And really here the connectors are to kind of indicate the integration point. So in the BI tools, you might create a SQL connector which enables you to parse the SQL that is generated by these tools into the TDS protocol to really then define this specification of what you're asking for. So let's look at a couple of code samples of in practice what this looks like and Andrew's gonna do some demos on this as well. So the equivalent pure here is really our kind of end goal of what the client is specifying in the UI and the equivalent pure is what we're trying to essentially push down into the database. So if we kind of look at what we've got on the screen, we've got that projection, so the flattening of the model and that really becomes your input. That's the kind of the legend service that's been encapsulated into executable units. And in this example, we want to push down an additional filter on the vehicle speed is greater than 50 and also then restrict it down to these four columns. So that's essentially the pure expression that we are trying to end up with that we then convert to SQL in the engine. So what does that look like from a Python perspective? You'll see here we've got the Python library it's called PyPure here and really that the frame, the input becomes a reference to that service. So this example is car crash is the name that we've given to that legend service. And then the user is writing a filter and a strict operation onto that. And when you go to materialize the frame really that's been interpreted, we're building up the TDS protocol and everything is then pushed down. And exactly if we move to the SQL operations here, again, what you really see in here is the same operation it's select these four columns from the reference to a legend service where the vehicle speed is greater than 50. So you can kind of see how all of these things are one and the same really. And the SQL expression will then get passed in the SQL connector, form the TDS protocol and gets all pushed down. If we actually look at what the TDS protocol looks like from a kind of wire protocol JSON perspective, it's really a very simple graph and pipeline of operations here. So you'll see at the top the first node being a reference to the LO service and then the two operations being pushed down on top of that. So I'm gonna hand it back to Andrew and he's gonna go through some demos and actually show this sort of live version of what was... Yeah, so we're gonna move to some demos in a moment. But really, before we move to that, we just wanted to touch on how we've been using it internally of having this tabular or query access onto services. And I guess from my own perspective of looking after our business intelligence platforms, this has enabled us to bridge that world of people that wanted to build their self-service analytics but then from a data producer point of view, the other, the producer side wanting to use the legend modeling capabilities. And so suddenly it now means that we don't have to, there just doesn't need to be a legend specific BI tool. It's you can use your existing tools and connect to that. So I mean, to touch on one example, there are a few written up here, but we had a team within RISC who were within our RISC division who had already created some data services that they were using in their backend analytics and wanted to provide some data reporting or visualizations to their management group. And so rather than breaking the encapsulation of the logic that they had within their data models, they were now able to use those same services and just connect directly to them using their preferred BI tools. So to move to the demos, we've got a few demos here. They all use a common data service for input to sort of keep them consistent. Really the idea here is just it's showing, it's a data service that a producer has nominally created. It's showing vehicle complaints information. There's, the service has one row per per complaint and has about a million rows in it. And then there's different dimensions of where the complaint was created, what vehicle was involved, et cetera. So the first example that we just wanted to show was being able to access this from Python in a kind of a standard Jupyter notebook environment. So this lets a suddenly connect data that's in the legend ecosystem into the Python platform where people may prefer to do more advanced machine learning or other types of analytics. You'll see here it's a relatively straightforward interface of just pulling in the library dependency and then creating a data frame and a query on top of that. People can preview the data as well to begin to get a feel for how that feels. And they can also build more complex queries. So in this case here, it's doing some aggregation, some sorts and limits. And the important part to reiterate here is that that query is actually getting pushed to the server. And so it's pulling back a smaller data set into their Python environment. So we're not pulling back the full one million rows. In this case, we're just pulling back the aggregation terms. And once we've got a pandas data frame, we can begin to use it in the standard way within Python. Like here, we can display it as a table, but we can equally go on to render it as a chart. In this case, showing the number of complaints by state as to where they were. So we begin to get that analytics capability. One of, actually, I guess this is just showing in slightly more detail the code that goes into that query because slightly concerned the text was too small to be able to read. But that's from the client point of view, their experience. But how do we show what's actually happening within the legend platform? Well, within legend, we have observability as a key feature. So we have Zipkin traces through it. So we can actually look at the spans or the activities that are happening as part of that request. So we can see all of the details of how long it took to connect to the database, how long the query took. And most importantly, in this case, we can actually see the requested payload that was sent by the Python client. So to zoom in a little bit on it and call out some of the features, this is the JSON graph. It's a relatively simple one that it built here. But it shows the reference to the service. It shows there's a request for the aggregations going on and the sort and the limit. Now, if we then jump to imaginary later on, looking at another step in that Zipkin trace, we can actually see the database SQL that was executed for that. So really, the reason for showing this is to repeat the aspect that we are taking the original data service that a producer gave. We're taking the user's intention of what aggregates, filters, or sorts they wanted, and then translating that down to the optimal execution to only pull back the data that they need. So you see the features here, again, the top 10, the order by the aggregations, et cetera. To flip to a different example integration that we have here, this is a web app built. So we have a few different integrations that are using the TDS protocol to fetch data. In this case, this is actually just a query building tool. And people can use it as part of exploratory data analytics to get a feel for what's in different data sets or as part of that data lifecycle to understand how they might want to create the model of what the range of values are and things like that. Or they might be exploring to see does it make sense to join two particular data sets here. So in this case, it's doing a very similar query here of taking it, building an aggregation, previewing the results. But it's a very visual experience. So non-engineers or kind of low-code users are able to use it to build those aggregations, filters, et cetera, and still just pull back relatively small result sets for them. And then again, just to show the kind of almost the traceability behind that, you'll see the features here. This one's showing the filter capability, ultimately if we go on to look at the SQL for that, we can see that it's got the same filter being pushed down here. You can see it at the end of there, saying filtering the state, which was the last activity. The final integration or demo that we just wanted to show was more from a commercial BI tool being able to let users connect to those data sources. It's got a native connector for people to be able to easily say, okay, I want to connect to, in our case it's called alloy because that's our internal name, but conceptually it's legend as well. It lets them pull in, they can give the reference to their service, they'll begin to get the natural feel that they would within that BI tool where basically the services from their point of view being presented as a virtual table. They can add calculations within the tool, which would again be translated into SQL, which then gets pushed down for execution. So yeah, really we're not trying to add a layer of large computation in the legend engine. It's all about how can we take the user's intention and translate it down to the best SQL for that. So in this case we've pulled up an aggregate number of the complaints in there. You can do the typical things that you can do in a BI tool where you can break out by your different dimensions to get the details or change the actual chart or visualization that you're using, switching to a tree map here. So essentially this is using the BI tool as a query building tool for non-engineers who know nothing about the details of the data service behind it. The SQL that's built in this case by that tool was relatively simple, but these can obviously get a lot more complex as people do other types of visualizations or compose multiple charts together to actually build their dashboard. And the risk of being repetitious, we see the same, you can call out the same features here we're referencing the common service for the entry point. We can see we're changing or restricting the number of columns. And again, we've got an aggregation going on, all of which gets pushed down to the database. So in terms of these features, we just wanted to, they're ones that we use already internally. Our current focus is on contributing those and moving them from the alloy code base and into the legend code base. The target for getting all of this released is in Q4 of this year currently, and that will include the TDS protocol in terms of the specification, the TDS execution, the Python library, and also the SQL connector. And so by doing this, we think it will help enable people to create more integration that will let them, as we've seen, get the even more value out of those data models that they've created and data services. So with that, we will, I guess, firstly say thank you for attending, I think, and the interest in it. Hopefully we've shown you some of the features that you'll find useful in the future. If you wanna follow more or find out more, there's the legend website on Phinos, or equally you can just watch the GitHub repository and see the code land there. Yeah, so from a sources that legend engine can connect to, it can currently connect to relational queries, as the relational sources is the main use case. It does also have support for fetching from APIs or services. But those source connectors for it are different modules within the code base. So they're relatively easy to extend to contribute new data sources and data types too. And definitely that's an area that ultimately the more sources that legend can connect to, the richer the ecosystem and is helpful. Yeah, thanks, anyone? Yeah, as I kind of said, I think a lot of it is really about separation concerns and kind of really data producers owning the technicalities of everything and having this nice logical business model that's been exposed as kind of the interface into that data. As I said, lots of features around the platform that you get for free and the lineage, the usage metrics, the Zipkin integrations, the monitoring, et cetera, that you, from a producer angle and from a consumer angle, you really benefit from. There's also a sort of a real focus around the SDLC of both the models and the services that people can produce. The tooling as well, so the integration with a different tooling and some of the low code tools that we've shown today. Yeah, I think, and as well, the angle of just connecting to a database table of having your BI tools referencing specific databases and things like that is not great from a data producer perspective around understanding who's using your data and things like that. So the engine is, I guess it's an on-demand query, as in it's a request response. Someone will issue a query, they'll get back to the dataset. It's not a streaming event-based thing, but obviously your data source behind Legend Engine could be caching or listening to streams in terms of it updating, but in terms of queries onto the engine, it's a request response kind of interaction. Yeah, I guess it's maybe worth also just clarifying, emphasizing whatever. Where we talk about models, these are data models rather than necessarily the financial, like the models of how do we calculate risk, et cetera. So typically, the data being returned is pre-calculated rather than actually running the scenario or analysis on demand at that point. But yes, ultimately you are dependent on the response of the database, but Legend does have ways for you to optimize that query routing logic. And that's actually from a platform point of view is one of the helpful features that a user might issue the same query, but if you notice it being slow, you can actually introduce aggregate awareness if you notice their query is an aggregate and you could materialize that as a different table. And so from a consumer point of view, the query doesn't change, but as the producer, you actually can influence where and how it executes. Okay, great. All right, well, John and I will be around afterwards. So if people have more questions, please just stop us and we're happy to chat.