 Hi we're excited to be joining you today to talk about a topic that may seem daunting yet is becoming if not already a key focus for businesses. I was reading a book recently on the technology innovations that are likely to reshape businesses in the next decade and a quote that really stood out to me was every company will become a technology company, every company will become a data company. That may seem far-fetched for some of you and for others you're thinking absolutely see what you mean and that will depend on the context of your work and the sector or industry that you're in. However if we start from this point then how we control and manage data is going to be key to future success. Over the next 30 minutes we will share some thoughts on data governance, dive deeper into data quality and how legend can help you bring this to life and finally look at how data governance is being considered in the collaborative opportunities across the financial industry to build confidence and adoption so that as an industry we are pressing towards the technology innovation evolving around us. Although I'm speaking to you in digital form right now I am also in the room with you during the conference here in London and hope to be in New York later in the year. I'm also joined by my colleague Beaker from New York so let's start by introducing ourselves. I'm Fiona Ackland an executive director in data engineering at Goldman Sachs. I co-manage our global data models and governance team focused on supporting the global markets division and acting as a consultant to other areas of the business. Prior to this role I spent six years in operations supporting our credit trading desk so I felt firsthand the complexities of managing large data sets and it's definitely set me up for the role I'm in now both at the firm and in the industry. I also have the pleasure of co-leading the Finos financial SIG and interacting with many of you to drive the industry forward. Beaker do you want to introduce yourself to me? Yes thanks so much Fee and hello everyone and thanks so much for joining our presentation today. I'm Beaker vice president in data engineering at Goldman Sachs and I'm managing the program to open source legend and also work as a product manager for the legend stack. This role entails a lot of exciting responsibilities one of which is being able to talk about legend with the awesome open source community. Fun fact about me I'm originally from Germany but lived in five different countries and I'm maybe one of the very few Germans that don't like eating potatoes. So let's jump straight in data governance can cover a vast array of concepts and your context is going to drive which of those are more important to you. It is said that data is the source of knowledge in the digital era that we find ourselves in. Even considering our interactions in daily life we rely on data and being able to query data quickly. I'm sure many of you traveled here today on public transport you may have queried train times relied on the underground status boards or perhaps you ordered an Uber. I'm sure there are many of you out there that perhaps buck that trend too but in all of these scenarios you're relying on the data that was made available to you and reliance was being the key here. The same is the case in our working environments we are all dealing with data in some form whether you produce the data consume the data or maintain the data having a level of control relevant to the use case is definitely necessary. In the financial industry there are both internal and external factors that drive this need externally being able to deliver exceptional client service is top of the list and this includes being able to exchange accurate data in an efficient manner to keep up with the moving markets while also transforming large sets of data in many different ways to meet post trade obligations such as confirmation and settlements. Another post trade obligation is meeting the regulatory mandates imposed on firms over the past 15 years regulations have been introduced as a governance structure to data and industry level and this has meant that each individual firm has their own data governance requirements to ensure that the data being shared is accurate complete timely and traced back to source. From an internal perspective data is the baseline to driving business decisions and allowing secure risk and control management this isn't purely at a monetary business level although clearly that is important to running the revenue divisions of a firm but data is driving people decisions in HR and environmental impacts from location and office strategies just to name a few. Hopefully I've convinced you that data governance is important regardless of your current focus and work scenario so what do we mean by data governance? I've mentioned a few concepts already but let's take a look at a couple of the definitions that may come to mind when someone says data governance so firstly ownership privacy and security this we're thinking about is the data stored in a secure manner who has access to that data perhaps another word you would consider is entitlements for example if we're thinking about our own personal data and the organizations that store that we want to have peace of mind that it isn't available just to everybody. The next one I have mentioned before but regulations have themselves been a form of data governance especially for regulators across the globe and not to mention that therefore we have to have our own data governance internally in each individual firm to be able to actually meet those regulations and over the past few years this has continued to increase and I'm sure it's not going to be going away anytime soon. Lineage and contract for this one we're thinking about where does my data come from what service level agreements are in place for me to access that data does the person who even is the producer of this data know that I'm using it and what happens if I or they change something these are all questions that can be answered and maintained if documented clearly and agreed by both parties through data contracts. With the amount of data available today knowing which version of the data you're using and whether new versions have been made available is highly valuable to ensure information isn't going stale we'll look at this one a little bit later in the presentation in the in in the context of industry standards but for the rest of the presentation we want to focus on data quality and data quality itself has many facets but the few you'll hear us coming back to are accuracy completeness timeliness and traceability we aim to show you how legend provides executable logic to impart this level of quality on your data such that you'll be confident in using it. Thanks so much for telling us more about data governance and bringing up the importance of data quality in that context I'd like to dive a little bit deeper into the subject specifically how legend may be able to help increase data quality in each individual organization but first let's take a step back let's imagine we work in a firm that is interested in getting data about the vaccination status of its employees as part of their return to the office strategy that may sound simple but in reality getting high quality data that are relevant for your use case and actionable can be quite tricky so let's take a look at how a typical scenario to retrieve raw data for a business driven report may look like you can see at the top here what kind of report we have in mind it would list the legal entity office location the employees names and their vaccination status but how do we get the data in our hypothetical firm there are three databases a firm database a person database and a vaccine database that stores the information we are interested in what you can notice here are the fairly cryptic column names of the data sets someone that is less familiar with the database schema or the data itself may not know directly which columns of interest for their particular use case to now actually retrieve the data we may have to use sequel or any other query method to extract the data from the different databases you may even have to use some excel magic to merge the different data sets into one coherent report long story short this is likely going to be a multi step manual and error prone process that can only be performed by a quiet tech savvy person so bringing this into the context of data quality you may notice the following problems and maybe tricky to ensure that our data sets are complete as it's difficult to identify the right databases and columns as well as understand how the data is related to each other and may also be difficult to ensure the data is accurate as the process of data extraction and curation is quite manual the business consumer may also need to solely rely on the accuracy of the query written by the developer with little opportunity to validate the data by herself lastly there's little transparency about the origins of the data and how it got transformed along the way if the data is being moved or altered or curious break and we as the business consumer may have no idea what happened so may there be a better way and as you may be guessing right the answer to that question is yes there is with legend so what is legend legend is a free open source data management software we work together with finnows to make the code available on github for everyone to install legend on their own premises finnows is also hosting a public shared version of legend on their servers and we will talk more about the industry collaboration happening on that version of legend later on in the presentation legend has been in the works for a number of years at Goldman Sachs and internally we have about 10 000 daily active users across many different divisions we developed the platform because we've seen firsthand the struggle with data silos duplication and quality as the complexity of data accelerated dramatically with legend there's now a free solution on the market that aims as providing efficient and reliable solutions to get access to data that is accurate timely and safe the heart of the platform is legend studio a data modeling environment that allows users to define business friendly business friendly concepts connect disparate data sets and visualize model data for easier collaboration and most importantly legend allows developers and non-developers to work together on the same platform so it's intuitive and flexible interface so let's take a look at legend more closely in the context of the use case discussed earlier instead of curing raw data we can build a data model with legend studio and curing model data you can see um a data model that would work for our use case here on the slide I walk you through it in just a second but first let's take a step back and define what a data model is simply put a data model allows you to build a better understanding of your data by creating business friendly concepts and descriptions of your data and define data relationships by doing this data models add a layer of abstraction on on top of your raw data to organize it and make it more usable and actionable across a variety of use cases concretely where we had very cryptic column names before we can now define business friendly concepts that are relevant and meaningful to us specifically a firm class an employee class and an office location class we can further define attributes such as first name and last name for our employee class to to define the introduced concepts in more detail we can also specify how these concepts are related to each other our firm for example can have one or many office locations or either no or many employees and this entire data model that you can see here has been created in legend without writing a single line of code in the context of data quality that is fairly important as data can now be described and agreed upon across different teams independent of their technical knowledge this tackles the historic divide between business and tech teams or differently said the consumers and producers of your data lastly you can see here that our relational databases that store the data we are interested in are mapped to our data model and these data stores can be of different types and scattered across the organization legend brings all of these together in one coherent data model and then users can actually execute queries using model data making use of the power of the powerful of the powerful execution engine that legend has thanks beaker do you mind if I ask a couple of questions as we're going through this next bit yeah of course you've made it really clear why we need data models i'm just wondering as a person less familiar with this use case is there a feature that would make it easier for me to navigate the data model and identify relevant concepts yes there is I could help you navigate the data model a little bit better by adding descriptions by attack values you can see an example here on the slide where I added an alias for a firm namely organization to enable you to search search key concepts using different terms making it easier for people to search across the data model can definitely help to reduce data duplication as it facilitates reusing existing concepts and mappings to data stores it's especially easy to look for these descriptions in the actual data model code in text mode well what's this text code well users can easily switch back and forth between a business friendly user interface and the actual code of the data model in legend studios text modes and changers made an either view as seamlessly translated one more thing to make it easy for developers and non developers to collaborate using the same platform that's really great I'd like to come back to that point that you made on avoiding data duplication is there another feature in the platform that would help with that uh yes do you do you see the arrow pointing to legal entity class this adds a hierarchical layer to your data model in which the firm class inherits all the attributes of legal of the legal entity class in legend this is called super time this reduces the need to recreate attributes and allows users to leverage existing defined relationships and mappings to data stores I see so just to check my understanding the additional descriptions to the classes the attribute or class inheritance and the ability to map your data model to databases can really reduce data duplication it allows people to identify authoritative data sources and then either inherit the attributes or map those to your own use case exactly and all of this may likely lead to more accurate and complete data that's good further looking at your data model I assume vaccination status is something that's considered quite sensitive how can we ensure that people know this when they're handling the data you're right like that's absolutely something we have to keep in mind and legend allows you to add labels to your classes and attributes via stereotypes in this case we can add a sensitive label to both the employee class as well as the vaccination status attribute this feature is quite helpful in drawing attention to data that may need specific entitlements the owner of the data can then make sure it's properly entitled and only legend users who are allowed to can then access it great I love the legend can help ensure data isn't getting into the right hands or wrong hands on a different note can legend help ensure that when I query model data I get the data in the format that I'm interested in yeah absolutely they are actually quite a few different ways we can accomplish that for every attribute in our data model we can see we can specify the data type for example if you only expect numeric and date entries you can indicate this by the respective data type equally you can also specify which fields are mandatory versus optional or how many values you're expecting to retrieve by defining the catenality okay that's good to know um what if I want to restrict the actual values that are returned to me in my query so if you know the data quite well you can specify valid entries via enumerations in legend for example I predefined the valid entries for my vaccination status namely being fully vaccinated not vaccinated and first shot only that really helps making sure that the data consumed is accurate so these were all great questions for me thanks so much and I like to continue and we touched about we touched upon this already quite a bit but but I would like to spend a few minutes on data accessibility through legend an important factor in data quality is to make sure that data can be safely accessed and um and consumed by by the end consumers ideally the data return should also be easily understood by the consumer and even better data consumers should be empowered to build data queries all by themselves without need without the need to know any coding language that way they can make sure they really get the data they want all of this is possible through legend data consumers can use business friendly terms from the data model and create queries drag and drop style and that's exactly what I've done to create the report we wanted to see at the beginning right here in legend so what I did is basically just dragged the business concepts that were of interest to me such as office location into the execution panel and then all what I what I needed to do was hit play and get my data for enhanced transparency legend also makes it easy for consumers to understand how the physical data sources have the map to the data model attributes you can see the mapping details here on the slide making it clear to consumers of your data where the information is coming from is a key ingredient for high quality data and lastly I'd like to mention that legend also allows for programmatic consumption of model data not only via ad hoc queries users can create apis with a simple click of a button and consume them via executable java files in their java application we are also working on making consumption via rest apis possible hence high quality model data can be used systematically in any production process this is so helpful I have one more question though what if I'm interested in a slightly different query so I'm only interested in the status of vaccinated and not vaccinated maybe that'll be important if governments start to mandate that only fully vaccinated individuals are back into offices for example do I have to create my own data model for this no you you don't you can build your own transformation from the model that I build to the one that you want to see we do a model to model mapping in this in this case mapping the enumeration values I specified to the ones you have in mind and in the context of data quality you'd still see the original shape of the data and how it got transformed along the way creating lineage of your data you can see the model to model mapping and the slightly different query results here on the slide an interesting point to bring up here is that you can see how accuracy of the data is very subjective to the individual use case your expectation towards receiving accurate data is different from the one that I had originally and legend is providing the flexibility to end users to bring the data into the shape and quality they want to see that's so true and I'm just thinking perhaps the government only put this mandate in terms of fully vaccinated people on firms that have more than 100 employees is that something that I could also specify in my model yes absolutely so what you can do is basically just add a constraint on your firm class which adds a validation rule to your data model so if you're executing your query legend would return a defect if the firm does not have at least 100 employees and these constraints can be both on the source class and the target class which may enable interesting and complex cross divisional consumer and producer constraints thanks so much beaker for for taking us through that and in an attempt to summarize all that you've just seen legend as a toolkit offers an interactive way to instill data quality as you build data models and query data sets the main aspects of data quality that we've shown through the features of legend are completeness with the likes of cardinality and enumerations accuracy through data types and constraints timeliness by reducing duplication through the clear definitions and tagged values and also having a look at sort of those class and attribute inheritance and finally traceability looking at model to model mappings whether that be relational mappings to data sets or producer to consumer mappings and all of this encompassed by being able to deliver the feature of privacy on sensitive data all of these concepts you may be facing for yourself at the moment if you're a producer of data how can you help your consumers by building logical easy to understand data models on top of your data sets as a consumer do you want to be able to start building a model to your specific requirements or as an engineer are you currently feeling stuck in the middle and could you help connect your producers and consumers directly using this tool all of those things we have been able to do using legend at our firm and so it's definitely possible regardless of the data problems that you're facing for you to also install legend on premise but legend isn't only available just for you to download it's also available on Finos as a shared instance and it wouldn't be a Finos session without us mentioning collaboration and the industry working together so to close this session I want to briefly look at how legend is helping maintain data quality through the industry efforts to build data standards and in turn improve the operating environment for us all the financial industry is complex and the data available on financial products alone is is huge and so the industry has been collaborating for years in different forums to improve the exchange of data and interoperability of industry participants in any multi-person setting change management of data or versioning must be maintained to ensure that data isn't lost and that we avoid duplicating efforts by building upon previous people's work it also makes it easier for new participants to enter and contribute which is a key tenet to open source in the case of the financial object sig and building industry standards such as the is the cdm legend is being used in exactly this way all data models built in legend are merged on gitlab which provides clear history of the contributor and their contribution through simple steps on the legend gooey you can test your model and code submit them for review and merge those into the working version this ensures that all participants are viewing that latest working working version of a model to drive consensus additionally the industry bodies that facilitate this collaboration are key to providing oversight and getting the broadest possible participation to the industry standards so that we can ensure when they're created they are reliable and usable in improving the exchange of large data sets in a safe manner by open sourcing legend we are keen to help and see more industry collaboration taking place working together with our peers on industry standards to solve current data problems felt is key for us to see the financial industry progress towards the technical innovations available to us all thank you for spending this time with us today if you're interested in hearing more on how to share legend with your organizations please find me around or email either of us at any time and there are a number of other resources available to you so please take a look at the legend in action videos on youtube or have a look at legend dot finos dot org thank you yeah thanks everyone for joining so hopefully that was helpful um yeah as i said for those of you who joined us a bit late my co-presenter is in new york and she couldn't make it over this time around so we had to record it so i apologize a little bit that it had to be delivered in that way but i am here in person i will be here all day i'm not sure what we're like for time i feel like we could be cutting it fine but if there's a couple of questions i'm happy to take them now um otherwise obviously do come and find me at any point yeah go for it so from a data catalog perspective there is actually another part of legend um so actually at our firm we use sort of a slightly different part that's not yet open sourced um i actually don't know if that is in the plan so i will find out for you um and confirm i would say in terms of if you wanted to just document a sort of a list of data attributes you absolutely could and sort of use legend in its most basic form in that sense um but there are other aspects so i'll come back to you um and i'll check on that so i wouldn't necessarily say replace um it is obviously similar to that um in terms of how we sort of use this in the firm it covers across all of our divisions so we have spoken to a couple of clients as well who are currently connected to calibre and i think they are looking at being able to use the two uh together to sort of bring their full solutions um but yeah definitely not necessarily a replacement something that can be used together with it obviously our phone that is this is what we use um and there's a couple of other aspects that we have internally that allow us to do that completely um and some of those will be open sourced uh in time but definitely sort of can be used in conjunction after we've been speaking with some of our clients as well yeah so at the moment the external version and i think i'm going to be writing this um can connect to h2 databases there is a plan and i'm not sure if it's been announced yet uh to connect that to some other external uh data sets and data providers so that sort of functionality is being developed upon but today externally it can be put to h2 databases i think internally we use it across quite a few i'm looking at Dave but he might be able to help so actually so my team within the firm specifically for our transactional data do exactly that so everyone who uses data and coming back to that idea of mapping so that model mapping that authoritative data source is something that my team own so we will certify every attribute that's part of that and someone then has to build a model back to that so we have to be part of that process to make sure that they're not using something that they shouldn't be to the extent that they do it's therefore not certified if that makes sense so they can go and do that but there's there's no one sort of standing behind it if they come and model to model map into our enterprise models using this then we as a team stand behind that for the next day that is a little bit beyond my technical side of things but catch me afterwards and i'll chat yeah it go back and tell me yes i think from the industry perspective which is a bit that i've been a little bit closer to um the big thing is around that need for the sdlcp so the bit that i mentioned about sort of the reviews and at the beginning when we open sourced actually that wasn't possible on the ui so you would sort of submit and then still have to go into git lab to do the rest of that process so that was something that we made available on the gooey very quickly because even for me so coming from the business side of things having to go into git lab is a little bit alien um i'm getting there i'm getting used to it but to be able to do it directly off the gooey means that we can get those business people also engaged in the industry standards very quickly and not assume that they have to then get onto git lab um they obviously have to have an account but they can actually do it all from the gooey so that was something that was really quick that we wanted to get over the line um to make sure from an industry standards perspective it it helped from people downloading it on premise um so when we've been working with clients um i know there's a couple of other banks that have been doing that as well it's just been really interesting to see how they have integrated it um and i guess every bank is different and has a slightly different way of doing things so just been interesting to sort of hear where their difficulties have come up which have just been different uh to ours and sort of being able to improve the system in that sense what do you mean by the yeah so within legend itself and before you can send code changes to review it goes through certain tests so to the extent that your data model has been connected to your relational database and any data is changing there you would get errors in legend directly yeah so if you were to go and run so if i was a consumer and i was running my mapping it would tell me that data had changed and was no longer meeting my requirements so it's not going to necessarily pick up every change it would pick up the change that impacted me yeah i tell her i'm checking in are we i'm not sure how long we okay sure dave do you want to take that one from a technical standpoint great well thanks everybody for joining us if you have any further questions and the two that i need to follow up come and grab me and i'll make some notes and take it back and then come back to you um but otherwise i hopefully see you all around and happy to connect to any point thank you