 Test one, two, three. Test one, two, three. It's fine. Check. Check. Yes, we are on 15 minutes of the outset. Unless people are really into open science. That's fabulous. Thank you for all your help. This is great. Hi, good morning. Now, the mic is working. The screen is working. We're good. Can't guarantee in five minutes. It's going to use this. This is fine. We're recording, of course, and there are people online watching this. Hello, online. Hello. Good morning. Good morning to the people at home in your pajamas. I'll just repeat the questions. It's the one thing I didn't plan for was practicing with the microphone. Well, if you want me to do this. Check. Wow. Oh my God. Voice of God. This is great. Okay. Caroline Cowart. Pleasure to meet you. Nice to meet you. Thank you. We're actually going to be watching this coming up. Oh, sweet. Wow. Wood lighting. Okay. All right. And here's my cursor and my right button and my left button. It's all good. Shall we start? Are we on time? Everybody's caffeinated adequately? Not enough. Yeah, I believe it. Is there still a Starbucks in here? I mean, several years ago. Yeah, that's okay. That's where I'm hitting next. Okay. Good morning, everybody. Thank you so much for being here. My name is Caroline Cowart. I am the group supervisor of the library at NASA Jet Propulsion Laboratory right here in beautiful Pasadena. We are one of 10 NASA centers. And today I'm going to talk to you about open science. What is open science? How is the JPL library not only involved in open science, but is it leading open science on lab within NASA and around the world? It's a really exciting time to be in libraries and in information science. Anybody who says libraries are dying obviously hasn't been in one for the past 35 years. So here's just a short summary of what I'm going to talk about, what librarian doesn't like a good table of contents. I'm going to talk about why open science. What's the point? What's the big deal? Open and fair initiatives. What's the difference between open science and fair data? Many of you are familiar with the acronym FAIR. We're going to take a little bit of a deep dive into that. We're going to talk about the data science activities and partnerships with the JPL library, how librarians are really heavily into data science as sort of a sister discipline to information science. We're going to focus on open access, which is one element of open science. Libraries have been involved in open access for decades, and I'm going to talk to you about what it is, why it's important to you, why you need to know about it. We're also going to talk about, I'm also going to talk about our partnership with our chief knowledge officer, the knowledge office, knowledge management, knowledge transfer, is a huge initiative at NASA, one that's been developing for a couple of decades, and I have a wonderful working partnership with our chief knowledge officer at JPL. I'm going to talk a little bit about why we do what we do. Knowledge management, knowledge transfer is tangential, but it's also related to open science. That's why I'm mentioning it. I'm going to talk about my involvement with the NASA, I'm sorry, the AGU data citation community of practice. This is an initiative that I started back in the pandemic project with the American Geophysical Union, and it has really taken off. I'm going to talk about why that's important to all of you. Also NASA's ethical AI initiative. I'm the only librarian that I know involved in ethical and responsible AI, at least in NASA. I see it as a direct adjunct to information ethics, so I'm going to talk a little bit about the relationship between those two. I'm also going to talk about the open science session at the American Geophysical Union Fall 2021 meeting. So all of this feeds into libraries and open science and why it's important. I'll talk a little bit about JPL if you're not familiar with where I work. NASA's Jet Propulsion Laboratory started in 1936 as essentially a chemistry and rocket propulsion laboratory on the campus of Caltech, again here in Pasadena, led by Professor Theodore von Karman and a group of students and fans and enthusiasts and hangers on. Back in the 1930s, rockets were so new. They were still relatively considered science fiction. Buck Rogers flying around in rockets, Ming the Merciless and all that. It wasn't a clandestine operation, but it wasn't the humanities or chemistry or mathematics which were really heavily celebrated in the 1920s and 30s. Legend has it. This is just Laura Legend's spurious grain of salt that the chemistry and rocket propulsion lab, they blew up a classroom on lab. Actually, I think it was a hallway or maybe it was a stereo. They blew up something big on lab. And if you've ever been to Caltech, they have this beautiful, Italian-ate, neo-Renaissance architecture, just gorgeous. And the administration at Caltech said, Dr. von Karman, we love what you're doing. Please go do it someplace else. No more blowing up classrooms or hallways or staircases. So the group found this sort of empty wasteland called the Arroyo. There was nothing. It was on a terraced hillside and they figured we can blow stuff up all we want and we're good, right? So that's how JPL wound up in the location it is just north of the Rose Bowl off of the 210 Freeway. And if any of you have ever been to JPL during our open house or on a tour, you know that we're on a very steep, steep hillside and that's why. So we won't hurt anything. Through the years, the lab has really grown. We've added multiple, multiple buildings. During the Second World War, the Department of the Army took notice at what Dr. von Karman and his team were doing regarding jet engines. And remember, going into World War II, we had prop planes, propeller planes. And the U.S. Army thought, ah, this is a great thing for the war effort. We can attach these things to our airplanes and makes them much faster and much more maneuverable. So JPL actually became part of the Army effort from 1943 to 1957. And in 1958, there was a momentous event all over the news. Something went into space that the U.S. kind of got very jealous of. It was called Sputnik. The USSR beat us into low-earth orbit. So NACA and NASA, the very nascent NASA administration and agency, again saw this potential in jets and they thought, okay, if we just orient them instead of horizontal, if we orient them vertical, we've got a cheap, easy, consistent way of getting out of Earth's atmosphere. So JPL became part of NASA in 1958. We are NASA's only FFRDC. What's an FFRDC? It's a federally funded research and development center. FFRDCs are all over the federal government. Lots of partnerships with different federal agencies, usually between either a private corporation or an educational institution such as Caltech. So if you see the Venn diagram, that's how we relate to Caltech. We are still a, essentially, rocketry and propulsion lab of Caltech, but NASA funds us, gives us our marching orders, if you will. Caltech provides the staff, NASA pays for the buildings. It's a real symbiotic relationship, as governed by something called the Prime Contract. So that is NASA and JPL in a nutshell. This is me. I wasn't always a librarian. I was not born a librarian. I got my Master of Library and Information Science from San Jose State University in 1997. I was a child. My undergraduate degree is in music, cello performance. And I still play the cello. I'm in several symphony orchestras around town. Whittier, La Morata, Bakersfield, South Bay, you name it. It's sort of my right brain escape. But about halfway through my schooling at Cal State Long Beach, which is where I went, I realized I had an epiphany. That music, like crime, does not pay. And I had to get a job. I wasn't going to be the next member of the LA Philharmonic, because I really didn't like to practice. I'd much rather be going to parties and hanging out with my friends. So I got a job in my local public library. I just happened to live down the street and it clicked. I found I had a real affinity and an aptitude for connecting people with the information that they need. And I was a library assistant without the degree for five years. Decided to go to library school. And 30 years later, it's a career, right? So I worked for all types of libraries. I spent the majority of my time in higher education at Cal State Dominguez Hills in Carson, which is about 30 miles south of here. I've also worked at University of Southern California, Pasadena City College, the Sierra Madre Public Library, and the 20th Century Fox Studios Library. I became tenured faculty at Dominguez Hills back in 2009. And coordinating information literacy initiatives, high impact practices, other types of initiatives, left it all behind to become the head of the library at JPL in 2017. So I just celebrated my fifth anniversary last July. And I'm going to go into six years at JPL, hopefully more. I'm having a ball. I get to work with very, very smart people every day. They challenge me. I have to bring my A game all the time, and it has raised my A game. It's been a really wonderful ride. Okay, so why are we all here? Open Science. Why Open Science? What's the big deal? What's the point? So Open Science, in a nutshell, is making the product and process of federally funded research globally available. The catch phrase is, as open as possible, as restricted as necessary. We're not talking about stripping away all security measures, but we want to make sure that federally funded, taxpayer-funded science is available to everybody who wants it, who needs it, who can use it, or who is just interested in it. So several initiatives have been brewing over the past several years. The White House, OSTP, Office of Science and Technology Policy, came out with a memo several months ago, essentially outlining the guidelines and rules and policies around open science, lifting restrictions, eliminating roadblocks, making sure that you and me as taxpayers can access these materials, the data sets, the software, the journal articles, et cetera, everything that is paid for through NASA and other government agencies. Oh, by the way, I have, excuse me, I uploaded my slides to the scale website, and all these links are live. These are everything in red, so you just can go later next weekend, you know, two o'clock in the morning. Just start clicking on links, and everything is live. So the SMD, Open Source Science Initiative, SMD stands for Science Mission Directorate. So this is the agent or the office in NASA that coordinates, funds, organizes everything that flies, everything that all the Earth science satellites, heliophysics, astrophysics, everything, any kind of satellite, any kind of rocket, any kind of aviation research, the Science Mission Directorate coordinates all of that. They have come out with their Open Source Science Initiative, and their tenets are transparent, accessible, inclusive, and reproducible, and their statement on their website, Open Source Science requires a culture shift to be more inclusive, transparent, collaborative, et cetera, which will increase the pace and quality of scientific progress. So what they're saying is opening up all of this information really benefits NASA, benefits researchers, benefits the entire world. Their community building effort, their wing of sort of outreach, training, communication is called TOPS, Transform to Open Science, and I've been involved in the TOPS Initiative for about a year and a half. Among other things, they're developing a certificate and credentialing program for Open Science, for university researchers primarily, but for any researcher who is interested in applying for NASA-funded projects, we have a threshold, we have some rules that now ask you to demonstrate how you do Open Science or how you plan to do Open Science with NASA's money. Not sure how to do that? Check out TOPS. They have a training program that walks you through from theory to practice to application. It's a really wonderful program. Also, I mentioned the American Geophysical Union before, which is the world's largest Earth and Space Science professional association. I got involved in AGU several years ago when I started at JPL because, to be perfectly honest, after 30 years in libraries, I'm kind of sick of library conferences going to talk to people just like me about issues that I'm intimately familiar with and I figured, you know, I'm brand new at this space thing. I really need to get to know my users, my library patrons and what better way than to start going to space conferences. So I became very, very involved with AGU. They have a wonderful data track, also an amazing DEIA track. If you're at all interested in equity, inclusion, diversity, check out AGU. They're really doing it right. They have a statement on open science, promoting unfettered communication of data debates, findings, et cetera. So open science really is a global effort and it's growing by leaps and bounds. There's funding available for it. There's training. There's all kinds of information. So it really behooves researchers around the world to at least check it out if not take a deep dive into open science. NASA also declared this year, 2023, the year of open science. And again, these are all live links. I'm not going to read these to you. But for a variety of reasons, we're really focusing on open science. This year there was a big run up. Last year a number of other federal agencies are also partnering with NASA in the year of open science. We want to do this as a team across the federal government so the messaging is clear and consistent, so the effort is clear and consistent, and so we're all kind of putting out the same information and sort of partnering with each other so that no one agency is far out ahead of anyone else. So at the very bottom, those two links, SBD 41 and SBD 41A, are newly minted NASA policy regarding science initiatives, science policy directive. So SBD 41 and 41A talk about specific things like how to and why to release your data sets, how to lift embargo periods on journal articles, what types of software and code and tools need to be released along with your science or your journals. They really get down into the specifics of what researchers must do in order to be eligible in order to be granted NASA funds for research. So I encourage you to check out those. If you're into reading government policy, those last two links are for you. Open and fair. So a lot of people conflate open science with fair data. Like what's the difference? They kind of do the same thing, right? They're bookends, but they are not two sides of a coin. There are some concrete differences. So open science, as I mentioned, is all about making the process and product of federally funded scientific research available. So that includes open access, which is unfettered access to journal articles. Instead of clicking on Springer or Wiley and having to pay 50 bucks a pop for an article, open access means the PDF just opens. There it is. There's the article. You can download it. You can share it. Open data also is sharing openly of your data sets on a public repository. And this is something that was a challenge for JPL because with more and more emphasis on multidisciplinary and interdisciplinary science, especially from the various decadal surveys that are coming out, it became difficult for JPL researchers to place their data sets in domain or discipline-oriented or discipline-focused repositories. They were getting turned away from repositories because they were saying, you're not earth science-y enough or you're not astrophysics-y enough because they were inherently multidisciplinary. So at JPL, we figured, why not? We're going to make our own. So we're ending a two-and-a-half-year process of creating our own in-house data repository based on the Dataverse open-source platform from Harvard University, and we're very, very excited about that. Open code, open tools, similar idea, posting them publicly for everyone to see, for everyone to download your software so you can run the data sets, you can replicate the results from your research. Fair data is a little bit different. Fair stands for findable, accessible, interoperable, and reusable. What does all that mean? So findable, can you find the data set using a standard search engine or an in-house search engine? Accessible, once you find it, can you even see it? Can you download the data? Interoperable, how do you share the data across platforms, across users, or is it just kind of stuck in this one repository? Reusable, how can you repurpose that data set possibly for your own uses? Not all data is fair, but there is a push to make data fair. What's the difference between fair and open? Just because you have a fair-oriented or a fair-platformed data set, it doesn't mean it's open to the entire world. You can have fair data inside an organizational firewall that is open and reusable and interoperable within your own organization, but it's not necessarily open. Conversely, you can have an open data set that's available around the world, but that is neither reusable nor interoperable nor really accessible. We were running into this a little bit when we were looking at the planetary data ecosystem, PDS+, so the world of planetary data, yes, was open, but it was very, very difficult to get into because the repositories were designed like labyrinths. There were dead links, there was an email, the repository owner for permission, so yeah, the data set was out there, but it certainly, it was very, very difficult to find. So fair can be locked down in-house or open. Open can be fair or not fair, so there are definitely differences. I talked a little bit about how to find these things. Search curation, we're involved, the JPL library is involved in search curation initiatives to make sure that our in-house data sets are findable by JPLers. We're heavily involved in data governance, data stewardship at JPL, and also semantic tools in information science, taxonomy development, knowledge graph, and ontology and metadata development. So the JPL library is really heavily involved in information science as well as library science. It's a really exciting time to be a librarian. Leading into some data science partnerships, when I started at JPL, I was handed the chair ship of the ontology working group, and having come from 20 years in higher education leading information literacy initiatives, I didn't know what an ontology was. I'd never heard the word before, don't tell my boss, but I learned very quickly what it was. I was able to kind of scaffold my learning and able to transfer some of the skills and techniques I had as an information scientist over to the data science realm. So I became a much more effective chair of the ontology working group. A couple of years later, I rebranded and renamed the group to the semantic technology community of practice because we were doing far more than just building ontologies. We were slipping into knowledge graphs. We were also talking about taxonomy development, metadata, all kinds of things. So this is one partnership we have with our data scientists at JPL. Also includes data governance. Like I said, we're heavily involved in data governance. We mint DOIs. I talked about the new Dataverse Platform at JPL. So the JPL library bring us your data set and we'll mint you a DOI so that you have a persistent identifier for that data set. We've been doing that for about a year and a half. Data placement, as I mentioned, we were having difficulty working with Caltech, NASA, the different discipline repositories, so we decided to build our own, but my librarians are expert at working with data sets and loading them and communicating with data repository owners. We also started a JPL-wide data management plan template and we have various taxonomy initiatives. Taxonomy building is a service that we offer through the JPL library. I hired a taxonomy librarian to develop this service specifically and I mentioned the JPL open repository. That's our acronym for Data versus JOR. We have to have a TLA for everything. If you don't know what a TLA is, give me after the talk. Okay, so the JPL open repository, a little bit of a deep dive. Here's the URL. There is a public and an internal side so that we have both publicly hosted data sets and material and content. This is the old JPL TRS, the technical report server. So we just ported all of the documents from the TRS into Dataverse and this is what we have now. Fair environment and foundation. We're doing open source, open tools, open access for all JPL authored papers and a little bit about the history, why we went into this and some of the features of the JPL open repository mentioned data storage. Dataverse is a wonderful platform. It does auto-abstracting. So if you have a paper and you think, okay, we need to OCR this and get an abstract, Dataverse will do that for you. It's a wonderful tool and also auto issues data DOIs and it has wonderful search capabilities so you can bundle a journal article with a data set with the software that goes with the data in its own folder. So you do a text search, a natural language search and it'll send you a folder of, hey, here's an article and here's everything else that goes with it so you don't have to go hunting around the repository for all the materials. It's a great tool. I'm very, very happy about it. Open access. Okay, so what is open access? Librarians have been involved with the open access piece of open science for 20 years. So as soon as the majority of scholarly peer-reviewed journal articles became digitized and the majority of them available online as opposed to a paper journal, publishers began taking these behind paywalls. You had to either pay by the article or pay for an ongoing subscription to access journal articles. Most university libraries encumber that cost for either the pay by the drink or journal subscriptions so that if you're affiliated with a university you can just click on the link and you have immediate access to that journal article. But what was happening is that subscription costs were going higher and higher and higher and university libraries were having to cancel a lot of their subscriptions, making it much more difficult for researchers on campus to access those articles. And so open access was, well, is a solution to that. What open access is, they remove the paywall, the publisher removes the paywall, they remove the roadblock to reading the article. So if you are a member of the general public and you are looking for an open access article, if you find one, the PDF link will just be live and you will not be presented with the option to either pay $60 for the article or contact your campus librarian because if you're not affiliated with a university, you think, I don't have a campus librarian and I really don't want to pay $60 for one article because if you're doing a literature review and you need 15, 20, 35 articles, that adds up to real money after a certain point. So open access is the ability to read articles without payment, without paywall. The problem is the publishers raised the publication charges. Researchers have to pay to get their stuff published, to get their articles published. They have to pay a fee to journal publishers, sometimes $1,500, $2,000, it can get very, very expensive. And if you want your article published as open access, that fee goes even higher. So there is, it's a double-edged sword, it's great for the global researching public, makes it more difficult for individual researchers because that money has to come out of your lab or out of your grant, how you're publishing. I could speak for hours on open access and journal publishing. Transformational agreements turned that kind of on its ear and built in these journal, these publication charges into paying for the subscription costs. So instead of paying for publication charges by the drop or by the publication, the library or the institution has built one large bill at the beginning of the year to cover all publication charges. The JPL library is pursuing joint subscription and joint access contracts with Caltech. We're kind of knocking down those dominoes one by one by one. In the past, our collections were separate because they came out of different pots of money. Our collections at JPL come out of NASA money, the Caltech collections come out of Caltech money. But we're finding ways to partner with Caltech so that everybody, both Caltech and JPL, has access to the same content. We also have a partnership on lab with our document review group. What document review does is make sure, essentially, that we're not releasing secrets to governments and people that we don't want to see them. They go through a clearance process. Every time we publish a journal article, document review has to take a look at the content of that article to look at the sort of the technology and the diagrams and the information that's going out to the general public. I had to get this talk cleared through document review so I could talk to you all today. We also have an open access week event with the Caltech library that usually draws a lot of people from both institutions. So that's what we're doing with open access. I'm going to kind of breeze through the next couple of slides. We do partner with our chief knowledge officer talking about knowledge transfer and knowledge management. There are differences between what information scientists do and knowledge managers do, but we do have a very fruitful partnership with them and we want to continue to grow that. As I mentioned at the outset, knowledge management is a bit tangential to open science, but it is definitely a growing part of open science. We started the data citation community of practice in January of 2021. So what is the frustration that this is solving? What we were finding is that citations to data sets in the professional literature were very, very difficult to find and they were kind of buried and people were calling these data sets different things when they published the articles and this was a vexation to data managers and repository owners who have to report each citation to their stakeholders to really prove the value of their repositories. So we started this with a group of researchers, publishers, librarians from in and around AGU and it's still going today to try and apply some kind of machine learning solution to finding and really teasing out citations to data sets within the professional literature. I mentioned ethical AI a little bit at the beginning. This is a real passion of mine because I see there's a real through line, a solid through line from ethical and responsible AI to information ethics. As a librarian, my acculturation and schooling is grounded in information ethics, patron privacy, access to information, quality of collection, et cetera and because almost all of the information that we access, that we upload goes through some kind of artificial intelligence filter if not actually provided by artificial intelligence. Hello Chat GPT. It is really beholden upon librarians, information professionals everywhere to take a look at ethical AI and responsible AI. So I've been on this road for about a year and a half, given a couple of papers, a couple of conference papers, one at Coastbar, the Committee on Space Research last summer in Athens, Greece. I'm very involved in the NASA effort and the AGU effort for ethical AI and responsible AI. If you want to know more about the difference between the two, I'm happy to talk with people after the talk about responsible and ethical AI. It's worth checking out. I briefly mentioned the NASA PDE IRB Planetary Data Ecosystem Review Board. As I mentioned before, our planetary data was in several repositories, was kind of in different formats, very difficult to access and so we tried to get our arms around this with a report to NASA from a couple of years ago talking about how to better present our planetary data to the general public, how to make it easier to find, easier to access, easier to download, more fair, compliant, but it's not just planetary data, it's heliophysics data, it's earth science data, it's astrophysics data, it's all the divisions. So you can check out the final paper. Again, the links are live, but I'm very, very proud of the work that we did on this committee because I think it starts the conversation about how to structure open access and open science data so that the general public can access it. And then finally, we ran an open science panel or actually session at the American Geophysical Union Fall Meeting in 2021 called Yes, We're Open, The Benefits of Fair Data in Open Science. We had a couple of spectacular keynotes. The former chief scientist of NASA, Dr. Jim Green, gave our open science keynote and the chief executive officer of NISO, National Information Standards Organization, Todd Carpenter, gave our Fair Data keynote. Couldn't be prouder of this whole effort and that really got the ball rolling and the conversation started across the broader AGU about open science. They've been involved in open science for a while, but we had 14 speakers, had very young publishers, diamond open access publishers, happy to talk about the different flavors of open access with anybody here. So wrapping this up, why open science? Why are we doing all this? What's the point? At the end of the day, I have a couple of maybe self-serving reasons to do open science, but it really is all about broadening our reach, making federally funded material and science and the process and product of what we do global globally. And for me, this serves a couple of purposes. It makes current research more robust. It enhances the conversation. It drives science forward, but it also has the potential to inspire young researchers, high school kids, junior high kids who have never seen a data set before. They want to experiment with some open source tools or software and they want access to NASA author journal articles. You know, you can be an inquisitive 14-year-old, you can be an inquisitive college student if you're a K through 12 teacher who wants to use this data for a lesson plan. There's all kinds of reasons why open science is a really good thing. Another, I would say, a benefit of open science, it opens up the research process and the research is a product to underserved underrepresented populations, researchers from the global south, researchers from countries where they don't have a space agency or their science funding federally is not as well developed. This allows them to access everything that we're doing at NASA, everything we're doing in several federal agencies for no cost. It's just open. And so, again, it really fosters the science around the world and I'm very, very, I'm heartened to see where the next 10 years goes because of open science efforts. It also expands the scope of libraries. Again, this is a little self-serving. We're getting into data science, we're getting into knowledge management and computer science and being recognized and it really breaks down barriers and raises the profile for not just the JPL library but for engineering and science libraries and university libraries around the world. And at the end of the day the very last bullet is what it's all about. We are able to serve our library users much more effectively. That's my talk. Thank you very much. Please keep in touch and I'll take questions. We have time for questions if anybody would like to ask. We got a mic going around. Great. Thank you for your talk. I really enjoyed it. As a biologist who works with people at NASA who are mostly physical science folks, one of the issues that I have working with them is they often use MATLAB. This may be a question that you really don't care about but they often will put their data out in MATLAB and make that is not necessarily compatible with Octave which is sort of the free and open source version of MATLAB. Do you guys actually curate and check that code? Is that something that you guys do? So that's an ongoing conversation right now. Is the code dataset who is responsible especially in the clearance process who checks the accuracy who checks the veracity where it's coming from through the chain of custody if you will. And what we have determined at JPL is really that's up to the line management. The researcher that produces that code who checks the quality it's up to that person's supervisor the person that is really most intimately involved with that researcher's work so we're really leaving that up to the line management. We don't have any kind of external checking because a lot of the work we do is rather niche. It's pretty in the weeds and it would be very very difficult and really not super effective for us to find a group of people that go around checking datasets or that go around checking MATLAB code because only you know your research intimately and hopefully you're having a conversation with your boss about what you're doing and your funding and your time and all that. So at JPL we're leaving that to the line management. You said we should ask you after the lecture about Fair AI or Ethical AI Ethical and Responsible Artificial Intelligence Absolutely. So again this is something I'm very very passionate about. So Ethical AI came about the conversation started about five or six years ago I think because well we've all experienced ethical AI right? Attempts to gather our own data we're all being monetized, big brother, etc. And as an information scientist and as a librarian I am very passionate about information ethics making sure that the people that I work with who I'm shepherding through the information process have access to the highest quality the best information possible and if there is any kind of hitch if there's any kind of problem or issue I need to know about it as well as the library user so I got involved in Ethical AI from that standpoint from information ethics. So there is a difference between Ethical and Responsible AI? Pardon? Oh, examples of So a classic example is an artificial intelligence system loaded into let's say the healthcare or the medical care industry that is separating out patients for diagnosis and treatment that has either an unintentional glitch in the code or an intentional filter in the code that is making certain types of people with certain types of diseases or medical issues either blocking them from treatment or making their treatment more expensive or shuttling them off to other hospitals other medical clinics so they can't get timely treatment that's a thing that's happening is that providers of care, medical care and healthcare especially in a profit driven environment I don't want to get too political or preachy here but in a profit driven environment they want to expedite and raise profits and make sure that they're getting the most bang for their buck so there are some inherent biases in artificial intelligence systems now ethical AI is all about yes after the fact ferreting out those biases correcting those biases but also being proactive going in and saying okay when we develop this system we're going to be as unbiased as possible but also this is a call to surgery because quite often we purchase these black box systems and industry they won't tell us how they developed their algorithm well great example is Google how does Google search work they're not going to tell us librarians have been clamoring for that information for 30 years and they especially Google scholar Google scholar is a bit of a conundrum because how does Google determine what article or what content is scholarly well Google has their own algorithm okay what is it they're not going to tell us so it's that kind of black box mentality again this is part of open science is opening up that black box opening up the process opening up the product let's see your algorithm yes I mean we realize that there are some the proprietary items in there but at least share your process with us so the healthcare industry that's just one example we are being monetized every time we go online all kinds of you know do you accept cookies if you don't accept the cookies you can't see this website that's another I would call that unethical it's not open science or open sharing responsible AI is a much more global approach or a much higher level approach talking about who's involved in developing these systems how are they being funded what systems are they running on ethical AI is much more in the weeds and it's all about the development of the system great question so here's a quick question how are you combating AI hallucination say in in these systems that often happens when you have large sources of code AI tells you one thing it's trained improperly how exactly do you want what's your take on that I'm sorry it was a little hard to can you repeat the question a little bit louder so with AI how is one combating AI hallucination when you're drawing from multiple sources or bias sources how are you making sure to mitigate those bias in those sources as you go and how to incorporate that later on to combat hallucination so as a librarian what we are really working with is the product the end result of these systems is the information has already been filtered through some kind of artificial intelligence or machine learning process and so what we do is we a lot of comparative research we show the person we're working with I'm going to call them the user 5, 7, 10 different results and we caution them and we say okay here are the sources that this is where we got the information from let's sit down and let's walk through the process together so you can see what search I did and how we got this information you as the subject matter expert because you're you've got the knowledge you've got the science we have the know how in terms of finding the information we can partner and I can show you how I got to this and if the results are a little bit different we can have a conversation about why do you think they're different and we can dig as deep as possible into these systems into the back end and again this goes into the ethical AI question how open are these systems if they're not open then it just remains a question and you as the searcher as the user all I can say is you know be your own best advocate buyer beware and you may want to gauge your information analysis based on a couple of sources not just one I hope that answers your question I have a question shifting gears a little bit away from AI there's a couple of different ideas around repositories so for instance github is a repository system where you're constantly refreshing and updating your code libraries tend to have published works where you're not gonna maybe you'll get a second edition of a book but you might have the first edition of a book lying around your repositories where does this stand so if I gave you a piece of open hardware and then I had an iteration on that piece of open hardware is that a new repository is that a second version like how are you handling that and the references that go around because like a reference may be only relevant to a previous version of the hardware really good question in fact we just had this question come up in a meeting a couple of days ago so for us at JPL and again I can only speak for JPL the way we handle that is the nature, quality and distinction of the change in the hardware so from version one let's say we upload it to the repository it's out there you get a DOI you get a digital object identifier which is a persistent identifier to that version of the hardware or the tool or the code or the whatever if you come to us once a month, a year later and say hey we've made some updates we have to have a conversation because if it's a substantive change something that fundamentally changes the way the process works or the outcome of you run your data through some software if you make tweaks to the software if it changes the result we're going to assign you another DOI that persistent identifier only works for the first version of the hardware if there's a substantive change that affects the outcome we're going to assign you a different persistent identifier so that both of them are out there you can put a disclaimer you can put a readme file saying no this please use this one don't use this one but in the spirit of open science that first version is also out there for people to look at right it sounds like are you familiar with semantic versioning um a little bit yeah sounds very similar anyway other questions I was curious in terms of the work around knowledge management um what if any efforts are there around telling the stories like verbally whether audio video behind all the work that's being done both technical and also from like the personal narratives of those individuals really good question so there are a number of not only efforts but schools of thought around this um one school of thought is in terms of what's called knowledge capture or knowledge transfer when you have an employee who's leaving an organization you want to make sure that there are years of uh contacts and information uh everything in their head if you will doesn't walk out the door with them anything that would benefit the organization so you can sit down with them and do an interview and transcribe the interview you can you can video them and you can do sort of an auto captioning so that people can do a search on the text of the caption um so that's kind of one way you can sort of you can capture their words you can also uh have them you know sort of write out answers to generally canned questions that the organization has come up with based on what they feel is important another really interesting school of thought and something that JPL library has been doing for many many years is storytelling and storytelling involves someone with let's say 40 years experience at NASA just here's a topic or no topic or just sit there and just start talking because quite often our colleagues who've been around for 30 40 45 years are wonderful storytellers you get them over you know glass of wine or a cup of coffee and they will talk your ear off we want to capture that we want to capture we want to have a broader audience for that and within that there are two schools of thought to record or not to record because knowing that there is a recording uh a capture device in the room often changes how someone will tell a story they say it's not as spontaneous it's not as organic as if you're just sitting across for them so we've done it both ways we've had the non recording uh we call it JPL stories where we invite someone either who has retired or is on the verge of retiring or is just a really good storyteller just sit there with you know 15 20 people in the audience and just spin yarns for an hour we've also had a more intentional process where we give them a topic and they they tell stories again from their own perspective but we do capture that either on video or audio and we do like I said we do transcripting so it really it changes the dynamic between the storyteller and the audience whether there's a microphone on or not really good question any other questions alright well thank you very much thanks everybody I appreciate it the next talk in this room is at and we'll be going over chaos engineering in handling large datasets so slight slight change check amazing hello hi can you hear me awesome great okay hello everyone I was thinking I'd probably give just one more minute for folks to trickle in if anyone's joining and then we'll get started alright let's get started and thank you so much for coming in on a Sunday afternoon which is understandably not one of the finest signs for me to speak and yeah so today we are going to talk about chaos engineering for data meaning how do we apply the chaos engineering principles to the world of data and use that to manage our data stages in a complex state of law and I am Vino and I work for this company called Lake FS which is an open source data versioning engine for large scale data lakes which is essentially say for example if you are familiar with Git which is all of us are Lake FS is just Git for data it is going to give you large scale data versioning just like what Git does for your source code for the data and I am currently an open source contributor and a developer advocate for the Lake FS project and as you can see I started off as a software engineer and then moved on to ML space and only currently I work as a developer advocate for this OSS project Lake FS you can connect with me on socials here I mostly talk about and write about data engineering best practices or how do we you know learn these data engineering best practices from the software engineering and the DevOps side of the world now before getting into data I wanted to also understand what are we working with like people here maybe by a show of hands do like managing a data team or are you a data engineer are you working with analytics data okay okay so I see some hands in there so it is going to be a very interesting conversation one biggest challenge when we are working with data and when we are working with distributed systems and for the scale of data that we process with these distributed systems is how most of these distributed systems are stochastic right it is not deterministic so spark under the hood is almost always not going to do what you coded it for the stochastic elements are going to mess you up and you would end up with you know indetermined state of data or inconsistent state of data when you are like running a spark job and if it fails in between and it's not just with the spark job but there are like a slew of other challenges that come in when you are working with large scale data flows as well and as the data engineer as the data engineer who has worked with Nike and Apple at different data teams these are literally some of the issues that I had faced in my personal life and based on your experience and the kind of data on the industry you come from I am sure you would have your own set of issues that you may have dealt with when you are working with big data systems for me a new component logic is as simple as the business has now defined or like redefined a KPI definition which means as a data engineer or a data infrastructure engineer you have to go back and re-run all the data pipelines that you have to produce like backfill the data for the historic period that you are looking at or new data sources I don't even have to get into the details of this because we have different data sources just because all these data sources are writing to our data lakes in the form of APIs which are just JSON files they very easily update their schema or you know add just a new key and value to the JSON files and it would mess up almost all the data that we have downstream especially for the ones that are working with the business data and analytics workflows in the downstream flow incompatible schema change of course and the most common challenge as data folks we would deal with are the schema evolution and the schema enforcement problem again due to how our data sources write data in the form of data sources in our data lake and spark jobs running in parallel and sometimes we end up re-running when it fails and if the spark job is not idempotent it would end up creating duplicates of data it's very simple right when you write a spark job instead of overwrite if you do it in an append mode it's going to create a duplicates of data which you definitely don't want to deal with it's as easy as that to create duplicate sets of inconsistent data for you and changing tables relationship keys of course because we work with data lakes that have hundreds and tens of tables updating one might not necessarily update the other it might lead to inconsistent table stages and also the relationship keys would get messed up and data deletion of course the FIA mishap that has happened a couple of weeks ago and we all know how that has affected all the airline files that we've been like in real time getting affected by someone accidentally deleting a data partition and data duplication again like I said one of the most common challenges that we deal with and in the data world testing the data systems or the data pipelines is very hard compared to the software engineering world because when you are developing an application it is almost like totally mainstream that you would go there is test driven development unit tests and writing CI tests and only pushing the code after the CI tests are run is the best practice that are being followed through and through but in the data world though because not a lot of us come from the computer science background or traditional engineering background a slew of these best practices are not being followed which also add to the challenges in working with these big data and while we know that unit testing and integration tests and the end to end tests are to be run at work even I have been guilty of doing this in some of the cases where we have not had integration tests run because at the end of the week you have the deadline saying hey I want this data to this team by the end of this week and it doesn't matter how you put together the data of what kind of work around you are doing to get that data out to that team so because of these not established standard best practices in the data world the data pipelines of the data systems is extremely hard so now this is where we take inspiration from the DevOps world in the DevOps side of things when you are working with SREs you think about chaos engineering principles and how do you actually implement those principles in your systems to make sure the systems are reliable and even when faced with adversity the systems are able to handle that and self heal from it in some cases and let's just take those from the DevOps world and see how those chaos engineering principles can be adapted to the data engineering side of things now the first principle of course is to define a steady state so when we are working with data we want to define what would be the steady state requirement which is in the SRE world you can think of it as if you have systems what are the systems throughputs whatever other SLAs that mean more to your team to keep the systems reliable but in the data side of it it's just data product requirements what are your data quality metrics and what are the SLA metrics that you are giving to your downstream analytics teams and the business users and defining all those product requirements would be defining the ideal state or the steady state of systems and the interesting challenge here is that you are not used to treating data as a product and so most of us do not have these requirements defined either from the business side of it or from the leadership or even from the data engineers who are working with the data to put together or to curate the data that is needed for downstream so there are no existing metrics that all the data engineering teams follow as a standard around data quality duplications of downstream SLAs for example and like I said before it's not that hard to just duplicate your data and mess with what's currently in production just you know change the right mode from overwrite to append you would now have two copies of the same data sitting in the same partition thanks to Spark and how easy it is to actually corrupt the data that is sitting in production now we understand the first thing that we need to do is identify and define the steady state and to define the steady state we need to treat the data as a product and come up with product requirements so the teams can adhere to these requirements and data contracts for example have been spoken about so much very recently which is almost similar to one of the most important requirements because data contracts talk about the product requirements or data agreements from your data sources all the way to your data consumers right and defining these data product requirements is the first step and now the second is in the chaos engineering principles we want to vary these real world events or stimulate these varying real world events like how the DevOps would do with their systems right like Netflix has come up with this you know chaos monkey what it would essentially do is it would simulate all the real life production issues that would happen in the production systems and they would try to like you know artificially you know simulate these to make sure how the production systems and servers are facing it or like handling it or even healing from it automatically in those extreme situations like put things to break and then see what happens similarly in the data engineering side we want to simulate these you know real world scenarios which are essentially again schema change and corrupt data and data variance which is of course one of the most important challenges especially when you're dealing with the MLops use cases or ML use cases right and if the data variance is very high over a period of time and the ML model that had worked before would not even like make sense anymore and it's a very like a significant challenge in the ML world and accidentally deleting the data of course and how do we actually go about simulating these real world events for us because simulating these is how exactly you would test your data pipelines right these are just nothing but a way of testing your pipeline and this is just like a fun thing because you know when you are on call it's always fun even as the data person you always have fun weekends having fun with the family not at all you know worried about that one pipeline that broke on a Friday evening nobody's worried about and the third principle of course is now first thing we had defined the product requirements for our data the second thing is we want to simulate certain real world scenarios and the third point is that we want to run these or like we want to simulate these real world scenarios on the production data but why not anywhere else because even if you are trying to simulate these things in a mock data for example right most of the times what happens is the data teams when we are trying to test our data pipelines we end up creating mock data because that's the easiest to create and easy to test and validate but the thing is when you are working with mock data it is not truly reflecting the you know variety, volume and velocity of the true you know real time of the production data and the second thing is some of us end up copying part of the production data out into testing or staging environments and then we try to test our production pipelines on that even if you copy part of production to outside like I was talking to one of you before there's data privacy regulations that keep changing on and off you know thanks to GDPR, CCPA I don't know what new you know regulation came up last week but the biggest challenge is when you are copying your production data outside to test your data pipelines now you need to deal with or make sure your data governance policies are in such a way that you are not violating any of these policies right it's not one of the greatest ways to go about which is exactly why you need to break things in production to see how your production data is capable of handling that now it's easy for us to say that you need to run these experiments in production and then break things and see how your production systems can handle that but is this how is this even the right thing to go about like are we all ready to mess with our production data right of course not so how else do we go about this this is exactly where lake FS comes in where you have a production data but it is in an isolated environment so you branch out of production data for example in this you know picture as you can see I have my main where all my production data lives I'm just branching out of main and create experiment one experiment or how many of branches you want and in those branches you can do all these experiments to make sure your production data is robust and reliable and it adheres to all the isolated requirements that we have set up in the previous stages and as you can see just an example of how you're you know how git for data works assuming if your data is sitting in an s3 bucket the the path of the s3 bucket or the data looks like you know what is on the top and by introducing lake FS here to do the versioning or to enable git for data all you will be doing is just add another prefix which would be the name of the branch and you can access the production data in that branch as is with very minimal intervention to the existing code base you will be able to run these experiments on your production data without actually affecting the production data because your data in main is still remaining the same while you're doing all these experiments in these different branches and now it is one thing to run these experiments manually by branching out and then experimenting in these isolated branches that don't affect your original production data but the another thing is again you know learning from this chaos monkey one thing that they do in the systems world or the DevOps world is that they automate these experiments to make sure like you know you keep running and breaking things to see how reliable or robust your production systems are now to be able to do these experiments in an automated fashion how can you go about and do that one thing is if you think about it you have multiple branches like main and experiment one and experiment two that we saw before and say you have a suite of tests that you've written like you know unit tests or basic integration tests that you wanted to run for example if you're a data engineer you have like yesterday's data partition that is sitting in a specific S3 bucket now you take the data you process it and do everything in an isolated branch and then put the data through all the set of tests that you have and only those tests are passed or successful and then you would want to deploy that data that new data into production system or put it into main by managing these life cycles stages of these data meaning you know you have the ingest data and then only if it passes through certain tests you want to push it into production so you're promoting data from one stage to another and you're automatically doing it in a more in a way that will enable and make sure your data quality is in place I would show you another like I would show you later in the demo how we can do that by automatically you know promoting data from one to another stage and another thing is like I was just talking about the different stages like you know in the development you would just be collaborating with multiple team members that you're working with and do the experimentation to just see how robust your systems are and then in the deployment side of things is where you actually need the version control because once you deploy it and if things are not what they seem to be you always want to have the revert option just like how do you do a revert to you know get out of inconsistent state of data and in the deployment phase is where you also want to make sure the data quality is in place and only if the data quality standards and SLAs are met you want to push that data into production by but then we also need a way to automatically promote the data from development to deployment to production without any manual intervention while we are running all these experiments which is where CI CD but for data comes in right how does it look like just like source code control of how we already know it works and CI CD works for the source code let's say all your data lives in main you're creating a new branch running your experiments running your you know suit of tests that you want to run and add new data delete new data and do whatever you know tinkering with the real world production data that you want to make sure how robust your pipelines are at the end of it you can have different hooks just like how Git has hooks right you can have these different hooks to run and only if certain set of criteria are met you can actually push the data into production and LakeFS enables you to implement this workflow by something called LakeFS hooks so there are different set of hooks like pre-merge, post-merge, pre-commit post-commit and a bunch of other operations as well and by doing this in this case for example if I don't want any CSV or JSON files in production I only want par-case and you can have that logic you know defined in your custom hook and you will be able to make sure all the files that are pushed into production will only be par-case now I've talked so much about LakeFS and how it can do Git for data and how it can help you implement CI CD for data and I just probably wanted to highlight a tad bit about how it works internally as well as you can see here LakeFS essentially is creating a bunch of pointers to the data that is stored underneath and when you try to create a new branch for example it is just copying these pointers to create a new branch so essentially the data that is sitting underneath is still the same but when you do create a new branch you have access to all of these underlying objects but through different pointers so you can work with them on an isolated new branch and currently LakeFS can work with all three cloud service providers like it could be an AWS S3 or GCS or Azure Blob and it also works with MINI or any other object store that is S3 compatible and as you can see here the minute you start adding new data of course it would do a copy on write because it would write the actual physical objects back into the storage and create pointers for that as well and because it is creating a branch is a metadata only operation where it is only copying the pointers you can create a whole branch in just few milliseconds without having to risk actual production data and what I mean by this is we have a couple of our users who are using LakeFS for example Lockheed Martin and Volvo and what not and they have petabyte sized data lakes and then when they are branching out like all these petabytes imagine you have like a few petabytes of data sitting in an S3 bucket and you want to mimic a real life production data and copy it into another bucket to run your experiment it's going to take hours and it's not even like advisable to go ahead and do that because it's petabytes of data and however cheap S3 maybe a few petabytes of data just sitting around in different environments would eventually add up and now just by creating a new branch and copying only doing a metadata only branching operation you would have all the a few petabytes of data available to you for all the experimenting and the testing and everything that you would want to run and now I've only talked about a couple of use cases which is one you know you can have LakeFS hooks similar to Git hooks to make sure all your quality checks are run before you push the data into production but there are a few other ways that we have seen users use LakeFS as well the first thing is quickly recover from an error remember that image that I showed you on a Friday evening you're on call and then there is something spooky with the production data and the business user wants you to look at all you need to do is literally like revert to the previous commit and then go happily have your fun weekend and then come back on the Monday morning and have a look at what's going wrong with the data the first thing to you know any troubleshooting is basically making sure your downstream consumers have access to that consistent state of high quality data and by doing the revert and then you know branching out of production and troubleshooting it you will be able to do this and developing isolation because of course you can create a new branch and do all your development work in that branch before you push it to production troubleshooting I just talked about most cases what happens is if there is a bug in production you don't want to you know work on the live production data although we've been guilty of doing that including me sometimes but the best way to go about is literally like you know branch out of the production data which has bugs and now you can debug do whatever troubleshooting that you want to do and even at the end of it if you have a hotfix that you want to test you can you know run the hotfix on this branch and only when everything is fine you can overwrite it back into your main which is merge it back to the main and atomic updates is again when we are talking about data lakes right none of us are working with one table or even two tables we're talking about tens or even hundreds of tables depending on the size of data that you're looking at and when you have updates to one data and all these tables are not updated concurrently because they're all coming in from different data sources at a different frequency and so on right and when you want all these upstream data to be available to your downstream users you want to make sure they are they're very consistent and the downstream users get access to the data in a very consistent manner meaning you don't want one table to be updated another table to be in inconsistent state like yesterday's and not very in sync with each other just by using the merge feature right you can have all these tables getting updated at a different frequency and when you do a merge they're all getting synced meaning all these tables are getting updated in your production at the same time or they're not so there is no inconsistencies across tables when you have this get for data feature and reprocess of course we talked about it suppose if you have your business users who have changed the metrics definitions or even if they have come up with new metrics that now they want to look at and how we've been doing in the past few years all you would be doing is reprocess the data backfill for you know so many years of data that we want to look at and even to do that you can have you can do all these reprocessing in a new branch at the end of it after you're done with all the validations then you can push the data to me so you can have two parallel pipelines running and one with the old metric and one with the new metric and see how they do before you actually do the merging together now like how how does it typical like you know conceptually it looks fine everybody understands you know what does get for data and why do we need get for data but then how do you go about implementing that for your work what does it look like and one of the most common use cases that we have seen you know OSS users try with the Lake FS is having it effective dirty ingest branch if you think about it if you have yesterday's partition that is landed in your you know SC bucket and you want to process that before you push it into prod you want to make sure the quality is up to the mark so you can do the dirty ingest work in a different branch and run the quality checks and then push it into production and as you can see here there is also a branch protection rules that can be defined so you can say that you know only certain team members can actually directly work with production data or the main branch and not everybody can work directly on the main branch that way you're making sure the production data is not getting messed up by anybody who does not have the right our back defined for them another thing is of course on the troubleshooting side of things or just an experimentation side where you can literally like branch out of the existing production into a new branch and then you can work with it for again if it is an even ML engineer you can work with the same data but you might try out different ML models to see what gives you the metric that you're looking for or you can even be working with different data preprocessing methods and so on so you can create new branches out of the same training data and then you can run all these different methods that you want in each of these branches and effectively at the end you can compare which one do you want to retain or you could just delete them all because if it was only a temporary sandbox environment for testing we don't even need to merge it back anyway just you had a new branch for testing you delete the branch and you're done with it now I have a quick demo to show of how one can use LakeFS to enable CI CD for data and the first thing is I've tried to simulate a data engineer like experience in the sense I have incoming data that's going to land in a specific S3 bucket and now I would have a new I would say a staging branch where I will read the data into a staging branch and I will do all sorts of cleaning the data and preprocessing and what not and after I'm done with that if the quality is up to the mark then I would merge it into main branch so this would be the simple workflow that I have in the demo okay so this is how LakeFS UI looks like it has repositories and it has certain admin functions and I have all of this currently in my local setup so instead of S3 I'm going to be using Minio because Minio has S3 APIs that I can use and in here I'm going to be using the Python LakeFS client because I'm a Python user and I would rather work with Python to show you how LakeFS can work for you now the first step is I will just go ahead and install the Python client of course and a couple of imports and configs and this is I have just a simple Netflix movies data so what I'm going to do here is I will go to the Minio bucket like if you are familiar with S3 you can just you know it's very similar to an S3 bucket I'm just going to create a new bucket and I call it let's say Netflix movies data and I'll keep all the default settings as is and then you just create a bucket so you have the Netflix movies data and now the first step is of course to go and create a repository right you need a repository like repository is a container that will hold all of your data and then you will eventually branch like create different branches on the repository so let's go ahead and create the repository I'm going to just keep it Netflix movies and the bucket that I'm going to be you know creating this on is also Netflix movies which we just created in Minio I'll keep the default as main and we are in the Netflix movies data currently it has no you know objects because we have not uploaded any objects of course let's go ahead and run so the concept here is that I want to create two other branches other than main one will be the English landing area where the incoming data would you know land and just be there and the other one is the staging area branch where we will be doing all the pre-processing and cleaning of data and currently when I try to list it it only has main so let's go ahead and create a new branch called the ingest branch from our main yes and another branch which is the staging branch from main and now of course we would have three branches ingest landing area main and staging and if we go back to the lake fsy quickly another branches now you can see there are three branches and one interesting thing to note here is that if you see these commit IDs the commit hashes they're all the same because of course they're pointing to the same objects underneath currently we have no objects but they're all pointing to the same pointers underneath and now so let's simulate you know incoming data the data source is pushing some data into your landing zone which is essentially I have the movie csv is just a small csv file locally I'm just going to upload it into my ingest path which is the ingest branch and okay so it is uploaded and if I do a diff I will be able to see an object of a particular size is added to this particular path so let's just go and check in the ingest landing area so now I do have this input file which is a very small csv just for the sake of the demo and now we do have the file here but then if you go to the uncommitted changes you can of course see what are the files that got added and you can even see the number of files that got changed of course and cool so we do have the uncommitted data let's go ahead and commit it and here just like it commit we have a commit message and you can even add a commit metadata here as one of these parameters like arguments to this function and in the metadata you can even define who is updating the data or who is changing the data that way you can have like a lineage of who is updating the data at what time and all the details that you want which can be available for later use for your team as well and now I'm just copying all of the data from my ingress to the staging area because staging is where I'm gonna do the actual cleaning and preprocessing so let's just go ahead and copy that go to the staging area okay so currently we have under this one I have this csv file in staging okay now awesome so let's go and commit this as well because we don't have any commits yet okay so it is committed and if we go see it in the commit history of course we should update okay so all cool now staging is where all the action happens right this is where you want to clean up the data partition it by a different column or do anything that you want to here again I'm working with just one table most likely you would be working with multiple tables and you do all the joints and everything that you need to actually curate the data for your downstream users and again just a quick you know overview of what our data looks like before we get started it just has a few thousand rows and a couple of columns as you can see it's just about movies data so this is how the data looks and another interesting feature that we do have with lakefs is this amazing duckdb integration so what you can do is without even having to write spark jobs you could just go to the lakefs ui and let's just go to the objects so you will be able to run all your queries and then like you can even do all of the data cleaning if you are a sequel person if you don't want to work with spark because most of the data folks are like almost divided right we have spark people and we have sequel people and if you are very comfortable with sequel and you don't want to deal with spark and you don't even want to deal with you know reading the data from like s3 and all the way you what you can do is like literally do the data exploration and even data cleaning and comparing data from different branches just to see if the data cleaning is done the right way and all of that can be done from the UI using sequel with duckdb but me however is going to continue with this spark here so I want to see if there are any null checks because how can a presentation like a typical data engineering presentation be without a cliched null checks and you know deleting the null checks right deleting the null values of course and so yeah there were a couple of nulls and now we've deleted all those records and now let's just go ahead and write this as parquet and we won't partition it by different country and see if that helps for our downstream now under the objects analytics partitioned it by different country so sparks still running and these other countries we have great if it done ok so this is successfully done and let's say like one of my requirements was that I don't want CSV file I want my files in parquet format so I wrote them in parquet format and I wanted to partition the data by a country so I partitioned them by a country and I cleaned up the null values and these are all my requirements and I'm happy with all my requirements and how the data turned out to be so you could just simply go ahead and merge this to main so you had incoming data you did some preprocessing and you're done with it and let's go ahead and merge this into our main branch and we had not added anything into our main before we added only to the ingress landing area and then we copy to the staging now in the main we of course have everything from the staging area which we just copied and if you go and see in the commit history of course you can see you know a list of you know the report was created and we copy the data to the staging area and then we loaded the parquet files and now eventually like we merged it and if you even click on the commit history you will be able to see the parent commit and then you can even like go back all the way to the commit that created each of these steps so the data lineage if you want can also be accessed from the you know the python api and you can use those as a commit log and you can even write it in the application logs that you want to as well and here for example again like you can check on the change summary and the number of files that got changed in each of these commits as well and at the end of it so I don't need my staging anymore I'm done with all of these so I could simply go ahead and delete it because it was a temporary environment that I created for you know pre processing the data and just do some basic cleaning and checks and once you're done we are good to go and this is a very simple example of or a workflow of how you can use the KFS and enable this you know automatic experimentation in production data and okay so the quick recap of what I wanted to do is basically by adapting the chaos engineering principles from the DevOps world into the data world we will be able to ensure the data the data systems are robust and the data is reliable and is of high quality and another thing is now we also know how to promote the data from one stage to another in an automated fashion while also sticking to our requirements which are like the data product requirements that are defined before and yeah so by using it for data we are able to do all these things and to implement it for data of course check out LakeFS and like I said LakeFS is an open source project and we have an amazing community of users and contributors and these are just some of the you know companies that are currently using LakeFS you know for their use cases of course and we have a thriving you know Slack community where people come and discuss about the different challenges in you know the data engineering workflows and also about the different data architectures that they've been trying out for their you know orgs and teams and again because it's an open source project if you think it can be of you know a great value to your team or even for your personal project feel free to check it out and you should be able to do a POC or a quick experiment with whatever you want with LakeFS and that's all I had for today thank you so much all right any questions I have you know I really appreciated the talk I'm more in like the HPC world but I can kind of already see a few use cases for this but I was just curious like like cutting edge like HPC is that like petabyte scale and now they're in like that pre exascale like pre exabyte level are there any like issues with like you know creating a data repository like petabyte level exabyte like are there any challenges issues around that right now LakeFS okay so currently like I mentioned a couple of customers they've been using it at a few like a few hundreds of petabytes and we have not had like a lot of issues like in terms of scalability that is what you're asking and you also had I would say like benchmarking tests and a few operations that we did to make sure you know what how the how robust the system is when it is up scale and I believe that was also done at a petabyte scale so yeah that would be it I guess is a follow up to that before I ask my question does it have to read your files or is it just looking at your like S3 index and so it says that has the checksum of your files and so it's like get in that it's checking summing your files but it's not actually reading your files to do that okay so it is reading the file because it is creating it does not use any metadata that S3 would already have or even if you're using different file formats like for example if you are using parquet and I'm sure you know it has its own metadata as well but currently in the in the way that LakeFS works it does not use any other metadata it just creates like objects hashes by itself so if you have it's not externally dependent if you have a petabyte you turn it on it's gonna take a while before you can start stuff because it's yes so when you import the data which is a petabyte scale it would take because importing the data means essentially you know creating a bunch of pointers and everything so my real what I was gonna say is essentially with all these data one thing that I've noticed in terms of the reading and the writing patterns is almost all these data and the customers that I've also like spoken to is that you literally like write once and then reading is where we do like a lot of operations so this a long time to scan through and like read the actual physical objects is not a problem because it's literally like a one time operation and then read is what happens a lot of times in these distributed big data. My real question was like it seems like there's certain types of data you're talking about and like you kind of didn't outline what this is useful for and what it's not useful for like you know there's like real-time data processing and there's you know all these kind of education is like and I'm not sure if that applies or not no I get your question so with the real-time data you know data processing I like it does not help I can mention before I want to probably like you know underline it again so data lake fs can work with data lakes which is object stores meaning if the data is already addressed or if the data is already in one of these object stores it can help you version those data that the data in motion during your like the real-time data processing of the data cannot be versioned or you cannot do get for data on that data it would make sense how would you compare this to Packyderm, DVC, Git, LFS any of the other Git for large data set products? Okay so there is Packyderm and DVC who are primarily focused on ML use cases and if you even like think about DVC for example right it literally downloads all of the data into your remote system and you are working with it and if you do have better bytes of scale data and you are working with DVC it would try to download it all into your local machine which is not a great use case for data engineering but with ML though even if you are you know running a big NLP model it would still have at the end of the day a few terabytes of data because for ML even if you have few terabytes of quality data you will be able to come up with an accurate model so the data scale or the scale of data is not that big a problem in ML which is where Packyderm and DVC are able to work really well but it's not going to be a great use case for the data engineering workloads where like you cannot literally download all of your data lake into a local machine you are working with so that's the differentiator Any other questions? Thank you very much Thank you so much