 Five-minute counter check in there for your mic autumn. Hello. Hello. Can you hear me now? Okay, you should have seen how many times we had to rerecord our podcast cuz I'm always And all they can hear is my laughs It will it'll quiet up a little bit once I close the doors But if you guys want to move forward a little bit too that might help as well if you're having trouble hearing Yeah, I have no in between. It's just too loud or too quiet. Sorry The stickers freaking amazing. I'm so excited. Justin we need stickers next. Dwayne has stickers Like weird and a wizard hat. I've been doing life wrong. He's got a wizard hat Clearly where I messed up. I couldn't get on like twice yesterday And I was like I should just finish my slides before the day before Be on my side Do we have any scale first timers? Oh, there's a bunch of you know, they have fancy ribbons over by the information desk That's a first-timer and I bet if you wear them people will be like really nice and give you extra swag And won't even scan your badges and send you spam emails Although probably not we're actually gonna show up. How many people are So-Cal locals a Bunch All right. I'm jealous of all of your access to tacos Where's the rest of our posse? I was I love how you guys were so much more fun like I'm gonna go to bed and finish my slides All right, who thinks that they came from the farthest away Where'd you come from? All right, we have a DC. I think France wins. Yeah It's my first be proud of me. It could have been my second Justin doesn't drink coffee judge him Dwayne judge him you have kids and you work in tech You should have seen the That yeah Exactly you should see in the adventure we went on looking for dr. Pepper yesterday come with me everywhere Your wizard you can just show up Like that's not better this last minute is taking forever Telephone cold or hot? Telephone cold or hot or just nervous? It's 1115. Let's close the store and get started It's Saturday at scale 21 This is autumn Nash empowering women in data shaping an inclusive AI future So please give a warm welcome and we'll let autumn take it away. Thank you all for being here I'm really excited to talk to you guys about data because I'm total data nerd so first I want to kind of tell a little bit about my story because I think it's When you tell the story it gives you context to some of the Salute not maybe solutions but ways that I think other people can help because first I want to tell you why this is a problem, right? And then I want to follow up with What we can all do to help and to make it better. So this is kind of like my context to the making it better part So I was a military spouse for 14 years and when you're a military spouse, you're moving around a lot You're constantly You don't have family. You don't have childcare. I had my own business and art degree Which as you all know when you do art you're poor So I was like I've got to go back to school I've got to do something different and I've been obsessed with computers since I was a little girl I got my first you know those iMacs that were different colors back in the day I got a teal one and I was like obsessed I was like look at all the gears and circuits and things in the background You know and like I just thought it was so cool and I've loved video games and coding and I did everybody's my spaces So I was like I'll go back to school for that One day I was complaining on Facebook about homework and why my Java code wouldn't compile and people were like You should go check out this group called millspouse coders and I went for homework help and came back with best friends Then I got the bug of teaching so we do coffees and codes we do we teach and We did hackathons, so we started our hackathon in 2014 I think No, 2017 was our first one and I got to teach military spouses how to code and I just really enjoyed it and loved it so fast for it a year and I was doing we were doing a PCS which is a military move and we were moving across the country and It was right after our first hackathon and Microsoft set it to pull to fully fund our hackathon and I was looking for an internship, but I couldn't get an internship because I was moving across the country So mice I was pregnant with my second son And I was like just give me six weeks And we can do this hackathon so that baby in that picture is My son and this is first hackathon So I had moved across the country and I was in school and I was trying to do all the right things But I couldn't even get an internship and you can't get a job without an internship, right? So fast forward Somebody contacted me about apprenticeships and they were like there's an apprenticeship for this Solutions architect job and I was like, what is a solutions architect? I was like, I didn't even know what that was. I was like, I want to be an engineer and They were like just try it. It's a paid apprenticeship and I was like at least I could pay for childcare, you know So I went and that's started my obsession with data I became a no sequel specialist solutions architect at AWS and It opened the world of data and what you can do with it Mail spouse coders has now gotten so many military spouses and women jobs that we are the first nonprofit That has been introduced Introducted into the military spouse Employment program that is run by the Department of Defense Which shows that if the Department of Defense is funding something on it, that's how bad of a problem it is military spouse unemployment is 27% versus 7 to 4 percent that normal people face and they're 51 percent under employed Which means you probably have a master's degree and you're working at target and it sucks My favorite project is my three little boys I'm a mentor for rewriting the code. I Judge the black hack hackathons and do workshops and coach and I'm also the ship it co-host what's you might recognize that dude We talk about everything after get push So in this talk, we're gonna talk is gender a problem in tech I think we've all seen all the programs about getting women into tech getting people into tech But it seems like we're not really moving the numbers and the thing that is most concerning is it's not just getting into tech Our retention rates are horrible Why is it important for us to have for data to be correct and What is the relationship between data and AI and how can women and diversity help that and how could this go wrong? We'll just pretend like we haven't written any of the news articles in the last month and How can you help because I think it's one thing to talk about a problem But if we don't have a solution if we don't work together to make it better The problem will just continue to be exasperated Okay, let's start off with something fun. I Want you guys to guess when this book was published Just throw out a ballpark No You're really close Would you say? No, so 1999 which is funny. Did you know there were more women in tech in 1989 than in? 2009 That's depressing Like we have put millions of dollars into these programs and then the recession hit and they were like just kidding So we've all heard all of the stories. It's the pipelines. We don't have enough Did you know that top universities turn out more women black in his and black in Hispanic computer science majors? At twice the graduate rate than the amount of them that are hired by technical Technology companies so there isn't a pipeline problem. The graduates are there, but we're not hiring them This slide is kind of stuff that I'm sure you've all heard something about at some point Women make up 25% of tech so that's one and four of technical roles. I Think the total of all tech rules of all tech is like 27, but technical roles are one and four by the age of 35 You're at one and eight Because 50% leave by mid-career. It's 56% So if you're already starting off with a really small percentage and then you keep cutting it over and over again and Women were affected by 45% of layoffs. So that doesn't sound like a lot But we were already starting with such a small number to begin with and Representation matters if you can't see yourself somewhere How are you supposed to dream of being there and Black Latina and indigenous women are only 4% of technology and in the UK. They were 70 point 77% also Fun fact, I looked for the statistics on black women engineers and it's not even a statistic I actually looked and I the only statistic I could find was 1.6 for the amount of all black women and STEM Scientist engineers, and I mean like chemical engineers like every time that type of engineers So there's not even a statistic for black software engineers And I've never seen another black software engineer like woman in real life ever. I think I've been in a meeting once Where I was getting yelled at by other black software engineer like managers and I was like Well, I wasn't getting yelled at but our team was let's say we were having a nice discussion But the whole time I was like, oh my god, there's other brown people Like I couldn't even focus when I was getting yelled at because I was like, I've never seen them in real life before That's not good So Think about this Have you ever sat there and saw something in an app or in software and said who did this and did nobody tell them It was a bad idea We've all had that moment, right? You're just like no one in the room was like, maybe we shouldn't do this now Everyone's arguing that we're in a recession, right? This is a time that we need to make profitability Our main aspect, which is how they justify layoffs, right? But we're ignoring the fact that we have proven data by Harvard Business Review Forbes and the economic the economic You guys know what I mean? Economist that women in your company and women leading companies give you a higher return on investment Women-led startups have a 12% higher revenue So if you're taking your bottom line into consideration, it is more profitable to have Women on your team to have women in your company Also, Harvard Business has said over and over again that diverse teams are smarter and they innovate faster You are going to build better software with diverse perspectives You're gonna have someone in the room that says, hey, maybe this is a bad idea. Maybe we should do more testing Maybe we should work on this Also in the mind of profitability women Influence 85% of discretionary fund discretionary spending So think about it You're just basically telling your audience and the people that you want money from we don't really care about your opinion That's not good business practice for anybody Have you ever seen like for instance when you're a real estate agent or most real estate agents? We'll purposely cater more to the to a woman if there's a couple buying a house because they know women buy houses in technology we're just going to ignore a Huge part of our audience and the people that we want to buy our products tech has faced over Has faced layoffs where we've laid off a hundred and fifty three thousand people It's disproportionately affected women of color women people of color and people on HP one visas but not just that who is who are the first teams to be cut diversity inclusion and diversity equity and inclusion teams apprenticeships Returnships and all the different programs that we had to have to get different people and people of different backgrounds to come in Which doesn't make sense, especially how we've already talked about how that relates to your bottom line and making better software This also signals to women and people of color that companies only cared about it when it was convenient You're signaling to them that you this is not this was not a real issue that you actually believed in How does this relate to open source because this is an open source conference? 70% of software that builds our infrastructure is open source software Yet only 9.8% of it is contributed by women so that means that we have a huge disparity of of When women and men of women compared to men's contributions in open source I think open source is one of the most beautiful forms of software development because it's a way that Everybody gets to contribute to one common goal no matter where you are all over the world And I think it will be one of the most important parts of AI because it means everybody has a chance to contribute to this new form of technology This is an article from 2017 and That we were already talking about talking about how data was the most valuable commodity Has anybody seen the talk about reddit and them IPOing reddit has been Technically non profitable for years right but what is making them profitable right now? Why do they have a five billion dollar evaluation because of their data, right? AI needs data Because we can't learn and make decisions without that data So the data is invaluable because you need good data to make good AI How could this go wrong? Does anybody seen any crazy like news articles lately about I don't know Gemini going wrong all the different Chatbots hiring and recruiting So there was in a UK makeup artist And she had already had the job and she got furloughed and they said hey come back and reapply After the furlough period she goes back. She reapplies and one of their AI hiring Told her that she had Body language that didn't fit their hiring standards. So she already had the job They told her to come back But they told her she wasn't qualified to be a makeup artist because of hand gestures she made during her interview We already know that hiring and recruitment is something that is already biased for women and people of color So as much as we could probably advance and make this more efficient. Are we doing it right where we can trust it? It's really important for us to take the time to make sure it's right Facial recognition software Facial recognition software is very exciting. We all thought it was super cool when they rolled it out on our iPhones We could open our phones and you know, we all love when you can do it Take technology. That's new and exciting but For every way that you can use Technology in a fun exciting way it can also be used in a way that could hurt people people of color are already Not treated fairly in the judiciary judicial system a landmark study showed that 200 facial recognition algorithms had a harder time distinguishing non-white faces these Algorithms are used by law enforcement border control and in many and is being integrated in many parts of the Judicial system that will they're hoping to Facial recognition software mislabeled 28% of Congress of Congress as criminals and it was disproportionately affected by affecting African-American and Latino lawmakers So we've already had a chatbot go racist and say anti-semester Smedic and racist things and it took 24 hours for it to have to be taken off of Twitter We've watched private contractors Algorithms that were built rate black people hire as higher risk for parole candidates We've had the Google phone who couldn't even recognize darker skin and we've had applications classify black people as gorillas One of the most important parts of being any business and being in technology is earning trust and if we rush Innovation we want to innovate fast, but not so fast that we lose trust of our customers And the people that we want to rely on our products Who which is the world-wide health organization is already concerned about putting AI in medical? Medical diagnosis and different medical technology Which I think a lot of the what AI promises would actually if we do it right really help people because the more We can diagnose and the faster and more efficiently we can help people and get them into treatment the better Right, but we already have such a distrust for our medical system Look at how hard it was for us to get people vaccinated during COVID and if we lose their trust they won't They won't get the care that they need so it's extremely important for us to build trust there and to not just roll out things so Some examples of medical bias. Did you know that women die at a higher rate of heart disease than men? But it's because we have cyst symptoms that are different, but for so long we only studied men So they were going untreated and unknown. So we were letting women die because of the way that we were diagnosing and recognizing that Did you know that? 69.9 percent Per I'm not percent, but per thousand live birth for black women Die, so basically black women have the highest maternal mortality rate and I am one of those women I almost died for six months after having my last son with preeclampsia And I was like I have to get out from my first my son's first day of school and they were like, oh, no Like which is different because Washington has really good medical And if I was anywhere else and if I had to just left I would have literally died My blood pressure was 170 over like 216, which is insane Like so this is like a real-life problem And if you look at just the history especially with people of color and distrust of the medical system If we further that distrust we will leave people not getting the proper care that they need Trust is very important in a relationship with people in the medical industry So we've already talked about the problems. We've talked about how this can lead to To the problems that can lead but We all know that having a diverse workforce and a diverse Perspectives help us to have less data bias and it helps us to have better return on investment Make better software. So what can you do in your company or in your teams to make it better? We can start with having a more inclusive work culture It's cool hire women hire people of color, but if you make them uncomfortable or you don't listen to them Or we're not having a place where people can work together and feel valued. What is the point of you doing the hiring initiatives, right? So mentoring and sponsoring. I think that's the first one. That is one of the biggest one the biggest things If you see somebody that's new and they get hired Can you Kind of have a buddy system with them and mentor them and help them learn? Maybe they are Experienced engineers, but maybe they don't know your company's policies. Maybe They're more junior and you can help them get to that next level. I think mentoring is a huge thing for one And also people think of mentoring as a one-sided relationship. It's a two-sided relationship Even if you're more senior you're going to learn from the person that you're mentoring. It is a Two way don't a two-way relationship where you're both benefiting from the situation. So you may be helping that person But you're also learning more Especially as engineers. I think that the longer we do something We forget how hard it is to start right like it's like when people write documentation If you write documentation for something that you've been doing for forever You may miss the things that maybe you need to like let somebody know when you're first starting out because you've Been doing it for so long When you get to mentor people that are more junior than you and that have a different background than you you learn how to Not only be a better teacher But to be able to scale better. So you benefit just as much as the person that you're mentoring benefits Career planning not everybody has the same goals as you do right not everybody has the same Track or you know Plan for their life that you do so help them But first ask them like what are your career goals? How can I be like of assistance in that way? Especially if you're a manager good managers make a huge difference People don't quit jobs. They quit managers, you know, so it's really important to Really like know the people that work for you like on a person like you don't want to be nosy but you know what I mean like see them as people right And return ships and apprenticeships I think that is If you just look at my story, but I'll also tell you another story. So I have A really awesome friend. She's a brilliant engineer and she also has my son has autism her son also has autism So she's the mom of a special needs child She had to drop out of technology for 14 years and the only she had tried to go back to get a Job again, and she has a great like a great degree Great career, but she couldn't even find a job and when she did they made her an associate engineer After she had been a senior engineer for years And if it wasn't for our apprenticeship She like she makes more as an she made more as an apprentice As she did as an associate engineer. She was making 50 of our apprentice salary That's like a fence Like that is ridiculous So apprenticeships and return ships don't seem like a big deal But it really makes me sad that that was the first thing to one of the first things to get cut We have gotten so many women back into the workforce through apprenticeships and return ships I've seen it in real life affect people's lives and change people's lives Also a subject that not a lot of people talk about is when you give women financial stability They can leave domestic violence situations Like it is so important to have like financial freedom And I think people don't realize that not just that but also we don't like I said we volunteer with a lot of military spouses There are plenty of them who have husbands who end up like hurt and they can't work after that So they become the primary breadwinner So giving women financial freedom can affect their lives in numerous ways unbiased interview practices. Have you ever heard culture fit? Not everybody is going to be exactly the way you are right and that's what makes our teams better We don't all have to be exactly the same if you ask me star wars versus star trek I'll have no idea what you're talking about but I can talk to you about some other nerd stuff, right? So let people bring themselves To an interview and get to know them and what they're working on Also, a lot of our interview processes in tech are very gatekeeping and biased. Let's be honest The way that we do interviews has nothing to do what we do with our regular job They're gatekeeping ways to keep people out of tech and if you really want to hire Good people really look at your interview processes and how you are gaging candidates Offer family friendly friendly work policies This is going to seem crazy, but did you know giving men longer maternity leave actually helped women a paternity leave? Because that means that you're starting off and you're establishing paternity establishing Better early caring caregiving habits earlier So if a woman doesn't start doing all of the work and all of like so if you give somebody If you have a woman six months of maternity leave and you give a man Six weeks of paternity leave you're already starting off with her doing more Of the work at home, right? Did you know that the american economy is basically Like profitable on the back of women doing the like labor at home 67 of women are caregivers, right? So we're if you think about it when you give more men more family friendly Work policies you're actually helping women too. So giving Just having a family friendly working environment really helps also men in the audience when you have to go pick up your kids Say that you have no idea how much pressure it takes off of a woman being able to say hey I have to go pick up my kids or I have to do this family thing and not being the only person to have to say that It really makes it that much more like digestible and you don't feel like you're always the person having to like you know have that I think making open source more inclusive and acceptable accessible so It's really hard to get started in open source. It can be really intimidating And I think sometimes when you've been doing it for a longer Like if you go on social media or like twitter or anything there's always an argument about how to get Like start it and if like it's accessible and then people are like well We shouldn't let certain people Contribute because you need to be at this level people are never going to be at a certain level if we don't let them try It's like those job Have you ever seen a job description that's entry level, but they want 10 years of experience and you're like dude, where? Like how am I going to get that if you don't let me in and give me a chance, right? So I think we should It's kind of like it goes back to remembering where we all start it, right? And remembering that if we don't allow people in especially open source it's to me the like best level of Software engineering and just kind of making software because everybody's working from all over for the common goal And it's usually you're working for free But how are you going to get a bunch of people to work for free if you don't let more people in Everybody that you onboard and help means less work for you. You know what I mean? So just really thinking about it and playing it forward in a way that you're Making your life better and theirs at the same time, you know Elevate women's voices. Have you ever been in a meeting? And you keep getting you realize one person keeps getting talked over over and over and over again I have said things in meetings and I've had to have I have this one friend And he'll be like, hey, did you hear what autumn just said and I'm like like over here guys So if you see somebody getting talked over constantly or getting passed over for the cool project or the woman's always taking notes in your meeting Sometimes if you say hey I'll take the notes or did you hear what she just said or that really cool idea that autumn had just said beforehand Little things make a huge difference and not only that but when you make someone feel seen and heard They're gonna work harder when somebody constantly goes to work And feels like they're not their opinions not valued. They're gonna stop speaking up They're gonna stop contributing as much because they just feel like They're not being heard. So those little things are a big deal Acknowledge the whole person And model bringing your whole self to work So like I said about like hey, I have to go pick up my kids at 3 30 and I'll be like unavailable when um When a guy says that it makes more room for when women have to do that too, you know We're not just the only ones of like, oh, she always has to take off to like go pick up her kids Life outside of work You don't want to be like nosy But you do want to like get kind of to know people right like Hey, we both have cats or we both really like I don't know anime or You know like it building that bridge and treating your co-workers like they're people and kind of getting to know each other on a people level Really helps us to be like our like whole selves at work, which helps us to like It helps you to I don't know just to feel like you belong right and that's really important Thank you all for coming and thank you for being open to the ideas Are there any questions? I'll come around with a microphone That way the people streaming can hear I enjoyed your talk um You went into some detail about mentorships, but I wonder if you could expand on sponsorships. Oh, okay. I love this so I think sponsorships are something that not a lot of people talk about The way that I explain it when I talk to like college kids and like people in early career Is your sponsor is almost like your pr team in rooms that you don't have access to right? So junior people are not going to be in the same meetings as people that are more senior, right? So you when just the fact of the You knowing that person is already going to elevate them right talking about them Say there's a project and you're like that this person would be perfect for that project They're really interested in this area But if they're never in the meetings that talk about these projects or where those decisions are being made If you mention their name if you introduce them to people like These things seem so small, but you'll be so surprised also. It's really interesting. So when you have a mentor relationship It's a two-way relationship, right? Those two people are interacting a sponsorship is like a three-way relationship So you have the sponsor you have the person that you're sponsoring But then you have the audience that these people don't have access to so there it's a three component relationship And it's like it's amazing Just like when you Sometimes people aren't the best at networking and technology, right? And when you can get them to people that like could be other mentors because sometimes you have more than one mentor, right? Maybe it's someone that isn't where you want to be in your career one day. Maybe they're you both have things in common like um, we have uh, like a veterans Organization, so maybe they're both veterans. Maybe they're moms. Maybe they're just different aspects of your life. So Even if you're connecting them to someone else that you think could help further them or help them feel like they belong or You know, it's really I think sponsorship is not only underrated But it's something that most people don't even know about, you know, and it can really bring a lot of value I love your jacket by the way Hi, i'm chloe. Nice to meet you You mentioned uh in your in your presentation the The sort of uh, you know suggestions for existing companies to implement to create an inclusive culture You also mentioned how The biases within the interview process and even After the interview process are both different From each other and between the different among the different genders Where the men have most of the privilege and everyone else kind of doesn't You also mentioned things about personal safety um, and how some people are You know rely on their income source to leave abuse for instance. It's really important. Um when So I fit into most of these categories here And when when looking for a job and interviewing What are at least in your experience? What are Suggestions that you may have for someone to gauge or an interview An interviewing company Oh, I love this question for green flags red flags and yellow flags, etc Okay, before I answer your question like presenting yourself, you know I would love to relate this back to something else before I answer your question So when we look at diversity like People think this is like some warm and fuzzy feeling. It's not this is not about warm and fuzzy This is about you making more money and drawing good talent and making good software There's nothing warm and fuzzy about this When you attract more diverse teams When I go and I look up your LinkedIn and I look up the people that work there If I see only one type of people I'm going to run for the hills, right That is a green flag when I see that you've hired people that are different genders that are different You know or sexual orientations that are different You know went like genders and colors of their skin like you know when I see a rainbow of people I'm like, this is my tribe, you know what I mean? So for one I look at the diversity I look once if somebody is like, uh, what is that one thing that one saying when people are like, this is a fast-paced environment and I'm like You know like because I will work hard like I took a data structured book in the labor And I talked to the p.m at microsoft and she had to tell me to get off the phone and go have a baby I will work so hard, but at the same time I like hanging out with my kids and being a normal human, right like That's okay like so I think looking for you have to first think about like what's important to you What is the life that you want to look for that you want to have right and then you find a job that within reason Is going to align with that we all have to pay our bills I'm a single mom of three kids and my Costco bill is ridiculous but I also want to see those kids and I want to be at the house that I pay for at some point so That balance, you know, but I think it's this market is hard, right? Like and sometimes we just have to do what we have to do But I hope that if enough people that are in influential places Can help us and see the value then they can make for a better environment, you know So I think kind of at this point We're hoping to get people that can make the right changes to listen Even though a lot of those slides have been said many many times before But hopefully we can see the value in them Hey, thanks for for bringing this talk You mentioned a couple different things in terms of like programs or organizations And then also talked about kind of people how they can step up Is is there anything in particular that you can point to that in general is more In your experience and the people that you've worked with have been Have had a greater impact. Is it is it generally individuals? Yes. Okay So there's two men on my team. I won't use any names, but they are my like rider diabetes, right? Like one is like big and scary and when people still in my meeting rooms, they just like stand behind me And interrupt people and they're like you're not listening to her like dude being yourself Being a person being a friend like on a I mean a human level Is important not feeling alone like anybody just doesn't want to feel alone They don't want to feel like they show up to work every day to feel like they select nobody wants to feel like that Just being a nice Justin is the reason I'm here. We're friends. We've been on twitter Now we have a podcast together You know what I mean just being a nice human like that is so underrated Like help people just if you see someone struggling if you see someone always being talked over if you see somebody that looks like they're constant like Uh, for instance, there's a girl at work and she always sat by herself and I was like, do you want to get lunch with me? You know what I mean? Like just it it's that like think about how we like it Blew my mind during covet people were so mean to each other and I was like, I'm gonna make you as a mom of a little kid I was like, I'm gonna make you all watch Daniel Tiger at this point Like sending you all back to Elmo Like which is like funny, but like, you know, like it's like the basis parts of being a human that we forget Just be kind to someone when they're having a hard time because I think that like they'll teach you what they need, right? Like we're all different But getting to know someone I can't tell you what the next woman will need because we're all different But I guarantee you if you befriend her and you make an effort Like, you know, I said bringing your your whole self to work She'll tell you what she needs and how you can make it better, you know So let's do one more question and I think we're at the end But the speaker can always talk to y'all in the hall Okay. Um, great. I'm the last question. Uh, thank you so much for your talk. I really appreciated it and it and it resonates with me Thank you for coming. Um, I'm not In software at all. I'm actually um A professor at At cal state l.a. And so I see the other end of so I'm seeing students so the younger generations coming in and um And I'm wondering what your kind of uh perspective and advice would be to kind of inspire and empower the younger generations to Go into tech because I see a lot of students being disengaged Um, a lot of them have been really impacted by covid a lot of them are dragging their feet coming to class Some of them don't even go to class um, so I'm wondering how do we Get that fire under them so that they're interested um in um, you know entering the tech world because uh, I You know, I want them to be I teach uh statistics, but I'm telling them like, you know Data science is where it's at for a lot of jobs. Um but um A lot of students and specifically as you mentioned a lot of women of color There's a lot of latinas at cal state l.a. And a lot of them just think like um, like they Aren't good at coding or like they just aren't good at math so they can't do it I got a whole art degree because I was like I'm not good at that stuff math is really hard math is really just instructions that you have to follow and processes But you're told the whole time math is too hard girls aren't good at math and you're just like Oh, well, I'll just go get an art degree. That'll be better You know, so I think we can start with representation because everybody we all do things based off of motivation Right, like we get up and go to work because they're gonna pay you right like so if you put maybe some Latina or women of color if you organize something I'll zoom in call me like But I like I'd volunteer with like rewriting the code and talk to a lot of kids And then I think it's really cool to build relationships and to kind of say like hey This is where you could be and like doing workshops and kind of making them believe in themselves But I think representation 100 percent matters like seeing someone that looks like you Like it gives you that like buy in that you could be there one day Which is why I always put my kids in my like talk because people are always like You can't be a mom and be an engineer at a big company And I'm like now I'm gonna do it to spite you like you know Like so I think like telling people and like showing them like hey, this could be you I know it's gonna be hard. You're not gonna be perfect at it. Some days. They're gonna suck like you might get a d in a class It's okay Like d's get degrees sometimes, you know what I mean like And showing them that like hey like yes, it's not going to be easy, but it'll be worth it, you know I think that's the best thing and I mean I will find you people to talk. Actually, I know somebody I know uh engineer at apple and she's latina and she's badass. I got you like she will definitely come talk But I think that's the best way Thank you guys so much I didn't puke Be proud of me. Does this mean I can get more coffee now? Thank you so much I was so scared Thank you for coming Oh, okay, I would love that Thank you so much for coming to the talk Uh, Seattle. Oh wait, I was getting like I was gonna take the mic with me green testing one testing one testing one two I'm gonna try to walk around so you got that little changer, right? There was one in here yesterday I Okay, maybe okay, I think this just turns the screen on and off. Okay All right, I can Right I'll walk around up here Okay See what you think about that. Oh, I think that's now with the project Little lunch We're all undergrads Especially with this stuff How many of y'all this is the first talk you made it to this morning? A couple of you just did a little sleeping in got caught in traffic Stopped for a coffee Maybe gas How many of you this is your first scale? Oh, we have a few How many of you are SoCal local? Big group Who thinks that they traveled the farthest to be here? We're from shout it out. That's the farthest 10 minutes away. No, all right go for it. All right south of fresno. That's what three hours four hours Uh, texas texas Anybody beat texas We got somebody from dc We had somebody from france in the last talk All right, it looks like dc is the farthest traveler we've got in this session We got about five minutes till start. Okay I actually we don't have a class yet on this No, I I know I mean it kind of fits with natural language processing The nlp stuff, but it's gone beyond that so all all the stuff that was valid like three years ago You can connect to it Right, there's no tree Well, I hope maybe to shed some light on that it is crazy stuff It's it's swirling around in multidimensional space That's where it that's where it lives It probably has some of that in it. Anyway, right. Yeah Two minute warning until start two minutes Countdown Okay, this is 12 30 scale 21 saturday sunny pasadena We have how open is open transparency and accountability in open source llms We have frank coil here professor at southern mess the southern methodist university specializing in ai From all the way back in the 1980s also. So we should have some good perspective Please give a warm welcome to frank coil Okay. Thank you very much glad to be here I was outside a few minutes ago approaching scale and I saw huge crowds of people and I thought wow this talk is going to be really popular And it was the uh, what is the the tv show Is out there and a lot a lot of action But i'm glad to be here and I want to give you a perspective on large language models from an open source and open viewpoint So I wanted I wanted to start off with uh A prologue so I was thinking about you know, this standard large language models of Are taking over the world and are involved in all aspects of society So I thought I'd have some fun And I like to have fun if I can and so I uh I decided to ask Uh dolly which is the The generative ai I said, uh draw me Asking for an illustration Of how large language models are being used in all aspects of modern society and uh, you see what What Dolly does it it has trouble with with with spelling But you know, there I am asking my question in some space and here is what dolly Thinks is the uh The answer Kind of freaky looking If you're afraid of large language models, this might this might this might make you uh Make you afraid, but it's got the some of the dimensions, but There's an aspect of this that I like that that that would that we'll we'll look at later is the We're in this three-dimensional space And so there's this aspect of dimensionality that that connects With large language models. So the reason uh, I mean the the driving force For me to to make this talk and to to come out here I've been to a few scales before When it was in Los Angeles And so there was an article I found that I think will be of interest to Many of you And I just wanted to share that with you. It's called opening up chat gpt tracking openness transparency and accountability And I've got a a link to the the article itself And just some quotes The main contribution of this paper is to show that openness is differentiated and to offer scientific documentation of degrees Of openness in this fast-moving field. So there's this concept of open source, but open source doesn't really Capture What openness means in in these large language models? And so what they did was they um, they investigated a huge number of large language models and Categorize them on various aspects of openness And so this led me to think about how to put this talk together and there's a A really interesting graph that's in the the article It's hard to see a lot of the details, but they basically break down A whole host of large language models and Talk about the openness openness of the code of the data Of the weights we'll talk about that A whole variety of things and they categorize them the green is very open Yellow is marginally open and then red is Not very open at all and you see chat gpt is red red red red red red red red red red red So there's a link to a website Where they they keep this updated. So you may want to take a look at that and I've got a link to that Toward the end of the talk also So What I want to just have you kind of think about is what does open mean For large language models. I mean we have the source code and the source code is simply the algorithm for the neural network that creates these Connections in multi-dimensional space. So we know about gradient descent gradient descent is the algorithm that That that the neural network uses But the real question is what aspects of The large language models pertain to openness as we want to think about it So I've kind of broken This down into into a few a few areas And so what I want to do here is I want to take a shallow dive into the technological underpinnings of large language models and again, I asked dolly for a shallow dive picture. So here's dolly giving me a shallow dive now so They spell dolly d a l l dash e which is a takeoff on salvedor dolly right the surrealistic painter So a little backstory here When I was an undergraduate at fortum university salvedor dolly came to new york city And I was fortunate enough to be in the arena Where he came to give a talk and people waited an hour or hour and a half before salvedor dolly arrived Place was packed waiting to see salvedor finally salvedor dolly arrives as Wearing this cape and jewels and carries a stick and comes in as a great things great things And so I'm not sure what he said the only thing I remember was there was a question from the audience. They said mr. dolly What is your conception of heaven? He said oh heaven Heaven is mountains and mountains of camembert cheese And I thought that was Very salvedor dolly asked comment, so whenever I Speak to the dolly. I always kind of think of of salvedor and and that so I want to take a Sort of a shallow dive into the technology behind it because I think it's going to be helpful For you all to understand what the what the components are and how each of those plays into the role of open so What do we need to create a large language model? So of course I had to ask dolly Give me in a give me show me Show me lots of data Truck loads of data. So here's this crazy Crazy illustration of a truck with lots of data Okay, so we we know we need data and data is the driving force behind the success of these large language models Okay, now what else Do we need? well Now now now now we're now we're looking at some of the The the technological underpinnings so all this data web pages wikipedia All these articles which are fed into the large language model or basically broken down into tokens little chunks of text typically their Words so and there are some some variations on this which we'll talk about later, but You take some words cat dog happy. Okay, and it takes those words and creates an embedding An embedding is simply a vector of numbers So every word becomes a Vector of numbers and we'll take a look at this in a minute, but here's an example now these The first gpt Vector take took every every token and turned it into 368 floating-point numbers And depends on your large language model some have 768 some have 1024 huge numbers of vectors A huge size of the bank, but every word is one of these vectors Okay, it's a lot of information then it takes these embeddings and this becomes the input to the neural network Okay, that will provide A significant amount of information about the text that's being read. Okay, so we've got data We've got tokenization We've got the embeddings giant vectors Floating-point numbers and then we have the neural networks. Okay, so Here is what an embedding looks like so I like to Think that the embeddings are really the secret sauce behind the large language models These giant vectors of numbers. So take the word scale for example now this This is a 100 element Vector a list of floating-point numbers. Okay, it's actually smaller than what's being used So I just found this example Just would fit on the screen, but you've got maybe a thousand of these numbers for one Token for one word. Okay So what is the embedding capture? We don't have we need this in from much information to keep track of one word. Well it Over time Learns the syntax of the word is this word a noun is it a verb is it an adjective? And it learns it by looking at examples of the text. It also learns the semantics. What does this word mean? What is the context of this word? So the word scale You walk outside the building. You see the big sign scale. We see it. We know that this is the conference But a scale is also something you use to weigh Objects. Okay, it's a way to talk about measurement So the the embedding captures this information Relative to the text so when you ask it Something about scale in Pasadena it will understand In many cases that it's not talking about something to weigh something but about the conference So it's able to distinguish the semantics what also Does captures word co-occurrences The basic job of the large language model is language and it predicts the next word of your sentence You ask it a question and it starts to come up with the words that best answer that question and what other words go together In this context It also understands relationships and analogies And can deal with hierarchical information. It basically builds these hierarchies of understanding So if you're talking about scale the scale it will know that in this context scale is a kind of conference Okay Sort of higher. It's one of the conferences that occur in in the world So this is pretty it's amazing. We don't really know how this actually Happens under the hood. We know that it happens and I think it surprised a lot of people when they started fooling around with this technology And bingo all of a sudden Things things started happening Now it does this by taking the those vectors and feeding them in to The neural network to a neural network Over time we get text coming in text coming in text coming in text coming in And those embeddings get fine tuned. They get modified When we begin when when the process begins those embeddings those vectors are random numbers Just random numbers with no meaning and over time those numbers become knowledge Those numbers encode the syntax the semantics the relationships that are part of the text that's being fed in And this is this is done by training with terabytes of data that are huge Energy consumers requires tremendous compute power To generate these embedding models So here's an example Very simple example of how something like this works So on the left side you see our words our tokens There's a an embedding vector It's only Only seven elements just expand this It's a 700 a thousand and The what's really crazy Is that the vector is a representation of a multi-dimensional space? Okay, now we are humans. We live in three dimensions Okay, we understand three dimensions four dimensions people say oh, maybe time is the fourth dimension Okay, but we don't have a picture of four dimensions, but what is the picture of a thousand dimensions? We cannot grasp what that really means But this is where the action is happening in this multi-dimensional space these terms are being Uh Being Embedded and so The way it works is and so you can do something called dimensionality reduction. So you can take A thousand dimension space and collapse it into two dimensions. You lose some information but basically what you're seeing here is that Words like puppy dog and cat appear close together in this multi-dimensional space Woman and man are closer in the multi-dimensional space than a puppy and a dog. Okay? So terms that go together are are Are close together and that's essentially how the neural that's how the large language models work When you enter a query into a large language model, it tries to find Things that are close to that query The the words that you use in the query and again, it is a language generating machine okay, now gpt one had an embedding vector which was 768 gpt three Goes up to four thousand ninety six Floating point numbers for one token Okay, and gpt four Not revealed Open ai don't want to give the store away Okay, not that it would help us to know the size of that vector because the computational resources needed to To do that are huge and that's why of course the gpu's are so hot and so successful and why nvidia is you know Conquering the stock market world with its prices and by the way over in the the exhibit hall they have They have this gpu driven machine with eight racks of gpu's go over and look at it the way it's 250 pounds And i'm not sure how much it sells for a couple several Thousand dollars at least From nvidia, but go have a look at it. It's at the meta booth So One of the things to think about when you when we think of the think of openness You've got these neural networks We have the algorithms for the neural networks, but we don't know How they're organized. So if you know anything about neural networks, it's basically nodes that are interconnected Okay Sort of modeled on the human brain which has neurons that are interconnected and so What they don't tell you is How many hidden layers hidden layers are the the layers between the input and the final output Which determines The Dimensionality of the space that the neural networks can live in So how many how many hidden layers does a a neural network have that is that would be a useful information If you wanted to reproduce what the large language model Does What is the size of the layer so not only the number of layers, but the number of neuron nodes That you can have in those layers is a complete variable. So there's a variety of standard Algorithms, but then there are parameters that are fed that are used by the neural network things like activation functions Optimization and then there are things like dropout. So the problem that One of the downsides of a neural network is that it will learn so well. It would it will Memorize all the information that comes in. It's like a student that just Memorizes everything but doesn't understand anything. So there's this Concept of dropout. So every once in a while you pull the neuron out to make it not Learn specifically so it can generalize in some way And we do know that gpt4 as advertised has 175 billion parameters So when you look at these neural networks have these numbers after them million billion, what does that mean? The parameters are the weights on the neural network The weights that connect Okay, yeah, so the weights are really Numeric values on the lines that connect one neuron to another So all the neurons in one layer are connected to all then the next to the next to the next and so these are The weights so when you hear the term parameters, that's what it means. It's the number of Lines connecting the nodes of the neural network now imagine trying to draw a neural network with 175 billion Lines connecting the nodes just gives you an idea of how complex Those neural networks are and I can't wrap my head around 175 billion parameters so trying to summarize Our dive here In addition to the data being used openness In terms of tokenization. So let me back go back to that That concept of taking text and breaking it up into individual tokens It's easy to think of the tokens as words, but it but in these in these large language model Algorithms, they're not always words. There can be parts of words. They could be fixed number of bites So what is the tokenization scheme? How does the there's the large language model give you their tokenization scheme again? The scheme is going to affect the end product Then there's the embedding Each token is going to be mapped to a hired high dimensional vector. That's our embedding What algorithm is used to do that embedding? And what is the size of the embedding? These variables are not fixed in any way And so this is also part of what you may want to ask When you ask about the openness of your large language model And then there's the neural network, which we just looked at which is Totally configurable in so many ways and the design of the neural network is For gpt part of the hidden hidden Hidden piece of information how it will Do what it does the number of hidden layers the number of nodes per layer These this is the information that you want if you want to have an open system In terms of the large language model so That ends our shallow dive And it takes me a while it takes takes me a while to get these pictures to be what I want You know had to have the guy going in so I said all right give me a picture of the The guy coming out of the water with it You know just show his head and so I got this guy look like a frog man coming out. No, no I don't want to fry died. I want nothing on your face nothing on your head and you have to be very specific I don't want you have anything on your head no mass I could not get it there The best I could do is have them with goggles. They just wanted he wanted to give me Dolly just you got to have something on your head when you're you're diving in the water. Okay So I'm going to take a little Shift in our conversation And talk about the european Recent european lawmakers passing the AI act the world's first comprehensive AI law, I'm sure you may have heard of this but the It was a big development And their basic their basic focus was technology should be human centric Not just technology for technology's sake It was passed by the european parliament It is going to be confirmed presumably by the eu member states and they Divide applications into three categories and the one of that that's probably most important to most folks or what they call high-risk applications banking systems Food related applications Things that are that are critical To the human race and that these if you're building these applications, there must be human supervision in the loop Cannot have this these things run autonomously also There must be a right for people to question and ask about why these answers are coming up why these decisions are being made and that the data Used to create the model must be published because it's the data that that determines What the output of these models are going to be and the fines are pretty hefty Up to 35 million euros are possible now You may ask What's happening here? Where's our laws? Where's our control? Well I asked all a for a little help in describing this so We have two forces at play here the ai alliance ai alliance is consortium of companies teaming up with academia To promote open alternatives to open ai and meta Is a huge player in this and IBM they were the they were the founders of this national science foundation has joined in The linux foundation has joined in Good news for the open source community hugging face which i'll talk about in a minute a wonderful website and over 50 companies They invited apple amazon google and open ai makers of chat gpt. What did they say? No, no We don't really want to be too open. We are concerned with safety Safe development is what we want and what we need So they they They created their alternative group the frontier model forum and Dolly had a little trouble spelling Frontier model forum, even though I put it in caps and quotes Please call it frontier model. I have frontier made but it did get ai alliance correct. So That's good So it's interesting the distinction, you know open but we The frontier model talks about safety safe development and laws to control Safe development. I'm not sure what that is going to come out to be Okay, so hugging face How many of you are familiar with hugging face at all? Okay. All right great. We got a lot of hugging faces here So hugging face is is considered by many the github of machine learning There's an extensive library of 300 000 models accessing A broad array of of of collections of data Uploaded by the community and it's a very uh It's a great site to To visit it's got spaces where users can create interactive Browser demos you can check out your machine learning models. You don't need to have a lot of a lot of technology and It's big on open source and accessibility deployment tools and you can get lists of Variety of models with the with they have cards you can give you the background of the data You can get data sets. You can download models that have already been Trained in natural language It's a it's a great. It's a great resource now. What are the advantages of open of these open models? So here's some here. Here's some some some terminology You're you'll run into the concept of fine-tuning So fine-tuning Means you download your model and and it has to be an open model so you can change the weights You're basically going to be changing the weights of the neural network that has Created these embeddings. You're basically You're basically Modifying those vectors So you download your model with the embeddings already created by by training. Okay, then you can use your own domain data So you don't have to give it to open ai you can use your own data to update Continue to train the neural network on your data so you can make it a little smarter about your data The embeddings then be can become tailored to your own Use cases and you get more accurate response So one way to think of this is you're giving You're you're running your data locally on your machine Uh And you're using the large language model to to write syntactically correct stuff Back to you Okay, and you you get more accurate responses Now there's another option besides fine-tuning and that is uh Something called rag Which is retrieval augmented generation so in this in this way You don't you don't change the weights of the model But what you're really doing here is you're using the large language model for its language capabilities And you're feeding it in the data that you wanted to digest and answer questions about Okay, so I actually use this a lot In my in my class classes So so this is this is a very useful way the the first example is It is is useful when you're speaking to the large language model So for the text enclosed in triple quotes Give me five bullet points that I can use in a powerpoint presentation So you let's say you have to do a presentation a lot you have Four or five six articles you take these articles Feed them in as part of your query Let the large language model know what is What is the data you wanted to digest and what is your question if you don't separate that sometimes it'll Not sure what's going to happen So that's been very effective. My favorite saves me a huge amount of time Making multiple choice tests up and it's really good at it Okay for the text enclosed in triple quotes Give me 10 multiple choice questions and put an asterisk next to the correct answer And it really does a good job. It creates all reasonably Reasonably sensible alternatives to the correct answer Maybe out of 20 questions. I have to correct one. You don't want to let it just loose Because then then you know the students are on my case afterwards. So um, and then do not hallucinate That's a very useful term to use when you're asking a question. Do not hallucinate if you don't know the answer Don't say anything or say I don't know and it will generally respond to that So what is what is a hallucination? Okay, what is hallucination if you think of this multi-dimensional space? We're Kind of trying to visualize hallucination is simply a choice that's the closest To whatever your question might be but if the information is not in the in the in the multi-dimensional space It'll give you Something close. So if you go back to the slide You know man woman it might The closest item might be dog. So I go my My husband is not a dog. Well, well Because it so it's not completely hallucinating. It's simply finding the closest match That it can find Okay, and if you use the if you use an api if you use if you programmatically Go in and use the api for example for chat gpt You can specify. I want to see the three. I want to see the five top alternatives and and then it gives you a A probability of correctness score. So you can see how Close things are Now when you do the when you do the straight the straight interface On the web it will give you the It's maximum, but if you do the api you can say I want to see the top five and you can then look at them and see Maybe Something that might be more more relevant so I came I came to Scale on thursday and I was fortunate enough. This is a little post-grip to our talk Um To the the zoom by bruce schnire at the koai conference and if you'd been there It was absolutely fabulous. So I just wanted to give you a little background if you don't know bruce schnire he's one of my One of my heroes. Um, so these I just wrote some stuff down at the talk and This is from bruce l l m's are not your friend They may sound really friendly and chatty But they're not your friend. They are just algorithms And the problem is that their language is seductive They sound like us and as he points out we tend to trust Those who sound like us Somebody from New York come from new york. You hear somebody from new york is immediately connection somebody from texas, whatever wherever you're from A dialect. Oh, how you doing? You're my pal right So I think that was that was an interesting takeaway The other point they act as our agents, but they're double agents They are they're gonna help you but they're also Create created by a company whose objective is to make money And their their ultimate goal is to is to to make money. They'll help you. I mean that They're not making money when I'm asking from multiple choice questions, but They do have their own their own agenda If you If you don't know about bruce schnire's monthly newsletter Subscribe it's free once a month. He sends His newsletter out the email and it has Really great stuff. It's called cryptogram. His original work is in security. He was one of the developers of Blowfish, which was one of the the major encryption algorithms And now he's migrating from the the world Well, not migrating, but the world of security and AI and large language models are now becoming in a way part of the same The same deal. So this is this is uh from His most recent Newsletter And it's called microsoft is spying on on users of AI It's like microsoft announced that it called chinese russian and iranian hackers using its AI tools Presumably coding tools to approve their hacking ability So that was the announcement In collaboration with open AI we're sharing threat intelligence showing detected state affiliated adversaries tracked as forest blizzard emerald sleep crimson sandstorm and all these wild groups Doing cyber operations As bruce says the only way microsoft open AI would know this Is by spying on chatbot sessions I'm sure the terms of service if I bother to even read them Gives them that permission and of course It's no surprise that microsoft and open AI And everyone else are spying on our usage of AI, but this confirms it microsoft Let's us know Let's us know So this is the the reference to the article that I presented at the beginning opening up chat gpt It's from the acm And there's a link to the website on on github Look at their latest stuff And From dalai That's all folks Any questions and if you would like a copy of my slides There's a qr code that will take you to The website If you have questions, I'll bring the microphone around so people streaming will be able to hear answer questions I have a question With respect to the the openness of different stages of these models, right? Especially can we go back to the time when we pre-trained the models? We are talking about fine-tuning Much of these models are open sourced pre-trained by bigger companies The question I have is how much of hugging face is truly open Because the pre-trained is given to us open source. How do you validate what's redacted from it? Nobody can look into these models to figure out what's in there So we start at a wrong point and we are heading towards even more. We are trying to correct it But will we be able to absolutely correct it? Yeah, no, that's a That's an excellent question, right? Because the these models that have been pre-trained they've absorbed that data and Underlying and and you can't go back. You can't open open up the the door and Right, so it made these right it's open But there's that little there's that factor and maybe the only way is to take the datasets that hugging face has And use the datasets Look at them and then use those to build your own large language model, which is expensive But for certain applications one could can imagine that If restricted domains Banking telecommunication health care That you could If you have the money you could do it. Yeah Right Yeah My question is regarding you in your slide you showed chat gpt3 use 175 billion parameters the gpt4 is using trillion parameters So if we take one parameter In the floating point 16 representation Two bytes for the parameters two bytes for the gradient and around 12 bytes for optimizer So each parameter will take around 16 gigabytes. So one billion is Around 16 gigabytes of memory. But if you have to use for 175 you are looking at terabytes and petabytes of data So most of the people in academics we are priced out We can do gpt for like work anymore. I mean machine learning We could afford to do it with the supervised learning with the Generative AI just because of the computing power you need hundreds of NVIDIA a 100 to distribute the memory and compute it. So how can academics people Do anything in this area? That's my question Yeah, well, I think the yes, how can so I I think the paraphrase your question how How Can academics even come close to building these models given the limited resources That they have and I guess and then again it depends on money and So for example at SMU we have a high performance mainframe computer and Some work could be done With these and maybe maybe the answer is going to be with the with the consortiums of academia and Meta and IBM and some of the big players to get together and Try to develop their own their own models like this But it's very difficult for an isolated university to Compete in any way with this with this no, you're right. It's it's it's an issue Thank you. That was a really Really informative talk I kind of got struck by something that you said in the early part and then later We saw the microsoft ai open ai Collaboration there and why wouldn't they that's functionally the same company now They had to use their own technology to mine the chatbot sessions And they did this for a good reason Earlier we talked about how wonderful it is that the eu Is insisting that there be human supervision of these models And I guess my question or Because maybe there is no answer but How can we hope on a the scale of petabytes that we just We're discussing how can we hope to actually supervise the actions of these models when the models are so huge there You said yourself beyond human comprehension Yeah, where does the human where is the human going to fit in that loop? it's uh, yeah but at least At least they're saying Just don't let them run completely amok Have somebody taking a look asking the questions it's a Right, it may be a fool's errand, but at least they're trying to To get a handle on it. Yeah Okay, no feedback. Okay So Fantastic talk. Thank you. Um And what the question I've got is around so the the open source initiative has been working on the open source ai definition and Including the breakdowns of what you're describing like what components need to be open for it to be considered open Source or open by nature, right? But when I talk to people about the need for these models to be open and so forth to be truly open not Foe open fake open People the pushback I get as well The limitation is I'm never going to be so big that Whatever limitation on the license that the llama model has is going to catch me that it doesn't matter to me, right? They don't see that there's a problem with it not being open And I'm having some difficulty articulating what what are the real risks of like what's going to happen in the future that you can't undo Because you because you are working with a non open something Not that you'll have to maybe pay more money, but there might be bugs or security things I mean I can think about it from a software perspective But I'm having hard time articulating like what are the real risks of not being open in the future If you can't if you if you're trying to do something open source like what you're using non open models Yeah, so you're asking what are the I mean what can be What can be done with An open source model In terms of ensuring security and fairness and And so forth Yeah You know you make a good point. There are you know terabytes of data You're going to go through the data and and try to find the The points that are discriminatory or or or biased and when you think about the bias and discrimination It just means we have more data that says x then We need we need more data to balance the bias I'm not sure I have the answer to that question, but at least people would be Taking a look at it at least there'll be the possibility of writing software to to mine The data in some way if we have access to it, but we have no access to it. It's just just in just the black hole A black box Yeah, hi, kai fam. Thank you for the presentation. It's very clear I'm just have a a different interpretation of it I work on on reasoning AI The first one is, you know, Alan Turan is amazing person exceptional His test is misleading Because what looked like the same Doesn't mean it's the same And last language model Looks like the same that human we are reasoning, but it doesn't And we should call that large words association system, but there is no language in it because It doesn't understand grammar It doesn't even though if you ask it it's going to answer you It looks like it understand grammar because it goes to the text that explain grammar But it doesn't understand grammar as we understand grammar And then if you work at the knowledge level You solve a lot of this problem One is the computational because knowledge is the most powerful way to compress data Human we do not process data second in term of Understandability and trust you can trust somebody if this person can explain what's going on based on the concept The system here doesn't explain on how it gets there. It just makes association And finally have you noticed that any picture we see that generated from AI looks the same We can recognize that in a second. Why? Because it based on the statistical distribution So it's not just that it's a double agent, but don't think that this thing understand anything Right. I mean there's no You know, it's it's kind of it's so different from the early days of AI and expert systems where you had rules And you could encode encode knowledge in some way Here this is just a this is just a just a word machine But one that lives in this multi-dimensional space that That does things we have no idea how it works. I I guess I guess one of the analogies that that I like to To think about is In the AI world of games and and training using a neural network example to play chess and go so if you know the history of of go and Lisa dole the the world's champion go player played Against the machine and the machine beat him Making moves that no go master had ever even imagined and the same is happening with chess So you let one of these neural networks chess machines loose Top players It will defeat the top players making moves and sacrifices that no chess master would make yet It's operating in this other dimensional world Where it sees things that we have no idea Uh And no explanations for and I think the same thing may be happening in the large language model You're living in this space that we have we cannot even I can't even picture the five dimensions more or less 4096 dimensions But yes, all right We can uh, do you want to take a few more questions or do we want to do hallway questions or Or How are you feeling I'm feeling okay if anybody have any another question happy to we got a question over here I'll try to be quick. I don't think that's even possible. All right So when it comes to the regulatory part of you know regulations for the most part are constraints Constraints of what you can do How where is the injection point for a constraint or some kind of limitation in the world of lms uh, because Are you trying to put it in when the actual training is occurring? I try to put it in the tokenization part. I try to put it into the neural network part How does one work when the model itself is designed to be creative and hallucinate? When regulations as a counterpoint are trying to be constrict, uh, you know constricting So where does that fit I guess? Yeah, you know, I Do not hallucinate maybe maybe that's the Maybe that that's that's the closest we can come if you don't know the answer If you can't give me an accuracy of uh, you know 0.9 Or more than don't answer the question Don't don't give me an answer and and lead me down a path. But again, you get probabilistic here and it's It's It's tricky. There's no answer. I have no answer We got one more question here And then we can I'm happy to Chat out in the hall So as all of these Uh tools are used to generate more and more data and What happens when the tools are trained with their own data? So the internet is becoming images and you know text and everything is ai generated. What happens? How do you counter that? By training your models with actual data generated from the models Yeah, the whole generative ai thing is uh You can create the variations of uh Of of text variations of a painting and and feed it feed it back in and There's no way to there's no way to prevent that So I don't know the answer to that but I there's there's a related related development in the The video generation World I think it's uh I'm blocking on the name of Sol Sora, okay, and there was an article that Sora is you was you okay? How did Sora get so good at making these these videos? and One of the answers might be that it was using Uh What's the video game? That has very realistic no the uh The video game or you can you can you can give you can say I want a video of a an apocalyptic World with individuals Sounds but it was using this to generate images generate movies And then feeding and training Sora with those generative ai images So that whole so it maybe can work both ways I don't know one more question Can you just repeat the question for the streamers? Yeah, so So I believe the question is what how do you handle uh The fact that these models don't forget So they don't forget, but what does that mean? It means that their their weights are set to Give you certain answers and maybe the answer is you you can feed it additional information to maybe offset And they they're forget their their remembrance is a little Fuzzy like our memory may be a little fuzzy From what happened to us this time last year because of all the other data coming in That's not a great answer, but that's the best I can come up with Yeah Because in fine-tuning what you're really doing in fine-tuning is you're giving it additional information to change those weights And the bigger the model the harder it is to fine-tune it for for a smaller model You can more easily push it in the direction that you want Yeah, it's in there It's in it's in it's in that embedding And all you can do is push those numbers in some direction One last question and then I think we're at our time Hi, great talk. Thank you so much My question is open source generally helped more open source projects Because there was sharing of knowledge so open source models Since what you're saying is they basically The secret sauce is the data they're trained on so There is there a way to have The scientific community or The education community share the datasets or I know there's datasets in hugging face Is that the answer to Allowing open source models to Um learn from each other And sharing knowledge so that we can progress I think it might be in fact, there are there are some frameworks that involve agents like Auto gen is is one of them where you can have agents Working together in a collaborative way and those agents could be connected to other large language models that have certain datasets And maybe that that way you get more more collaboration and more Bringing in other groups To collaborate in some way The best I can say in a in a very mysterious world that we live in Round of applause please for frank Thanks very much I'm sure he's happy to talk more on the floor in the hallway We've got lunch for an hour Okay Or maybe not How you doing? Oh, thank you Well, I I heard your I heard your talk the other the other day Well, thank you very much. Oh, thank you. Oh, can you can you can you unplug me? For example It doesn't really it's that's crazy it doesn't understand But it may Right Right, it's based on your your your experiences and your Your digesting and our brain is made For for language and to understand language Everybody gets something to eat Anybody ready for a nap? All right, let's try to stay awake For one hour Do we have any first timers? Oh pretty good number of first timers What about SoCal locals? Who thinks they've traveled the farthest to be here today? Shout it out From where? Panama All right, you got that beat. Oh, we'll have to get out of globe and Poland Panama. We had one from france two talks ago The one in between the farthest was uh, washington dc Okay How does the mic sound can you hear me? Is it close enough? Should I be closer? closer Up close and personal And then you can use that but you can use two minute warning While we're waiting for our two minutes Audience quizzes, uh, how many of you are familiar with the chaos project? already Okay a few How many of you are here more? Because you're into data science Okay, how many of you actively contribute to open source projects? Okay, how many of you use something to measure the health of your open source projects? Okay Cool. I think you're I think you're in the right place. This is good. Awesome. Thanks everybody for coming to our talk I am Technical difficulties Um, all right, there we go All right, so I'm don foster i'm director of data science for the the chaos project I've been doing this open source thing for 20 plus years. So a really long time I've worked at companies like intel and vmware doing open source um I also have some other roles outside of the chaos project So i'm co-chair of the contributor strategy technical advisory group under the cncf And i'm also a board member of open uk And i'm also on the board of the chaos project Hi, i'm callie doffy. I'm a senior data scientist at red hat and I work in the open source program office And I lead a project in there that focuses around doing data science on open source communities big surprise awesome all right so The data tsunami One of the biggest challenges people have with data is that it can be Overwhelming right now. This is especially true when you're looking at metrics for open source projects Where all of the data is available to gather and analyze So too often I see people measuring what's Easy easy to count easy to understand things like stars and forks Which don't really tell you much about the health of your open source project Quite a few metrics tools including the ones we have in the chaos project Display what I affectionately call a wall of metrics So these are pages and pages of visualizations That you can use to understand your open source project And they're they're great, but they require you to know what you're looking at And know where to focus your energy to get the data that you need So we have visualizations for dozens and dozens of metrics and all of them can be super useful But which ones you should focus on? depend on what you want to achieve Ideally you should be looking at visualizations that help you find areas that you can improve Or ways that you can measure what you're achieving whether or not you're achieving your goals as a project Within the chaos project we tend to look at metrics that's being used for two primary purposes Historically chaos has been a bit more focused on what I call contribution metrics So the box at the top and these metrics are designed to be used to better understand the projects that you're participating in Whether these are open source projects or other projects you open source projects that you run Or maybe open source projects that you just contribute to And we have a starter project health metrics model with a collection of four very simple metrics To help you get started measuring contribution And we have a whole bunch of other metrics models on similar topics to help you dig in for more details And lately we've been also focusing on consumption metrics, which is the second type. So that's the box at the bottom And these metrics are mostly related to inbound or downstream consumption of open source software So the software that you use from within your projects, your products, maybe your infrastructure And we have a starter viability metrics model with six metrics So these are some models that Gary White at Verizon developed for us I told him his starter project health metrics model had to have four metrics And and he he made it six He just he just couldn't cut it down to fours. Everything was too important Which which to be to be fair. It's uh, it's kind of hard But these are used to measure the viability of the open source projects that you consume And Gary also developed four complete viability metrics models with a whole bunch of metrics That were developed recently to help you gather more information about the projects that you're consuming And these collections of metrics are organized into metrics models Which can all be found at the link on the bottom of the slide To help you think about how you might implement a set of metrics together To understand some aspect of open source project health But a big part of using data Is figuring out what questions you have And how you can take this tsunami of data And figure out which metrics can help you turn it into something more meaningful So in the next section Cali is going to talk about how to do this using approaches from data science And what she's learned from building the eight knot visualization interface for chaos's auger software Take it away Cali awesome, thank you so I just see so I'm first going to take this from like a general data science perspective when people are talking about data science You're going to see some version of a graph like this Or just showing what the workflow would be It's always going to start with something around figuring out your problem space finding your data Cleaning your data Doing the analysis and whatever the reporting or deployment model would be And so for us when we think about open source data, why would we want to reinvent the wheel? Literally if you talk about the second graph over there Every time we want to ask a new question about our communities It's very time intensive and really starts to slow down that process and limits what your capabilities are Around the analysis that you do on your communities when you start to cut down that time That's whenever you can start to evolve and make your your visualizations a lot more complex and specific to the communities that you're looking at So the first let's talk about this more at a conceptual level putting the data aside for a second First you have to start thinking about what is your specific problem or problem space that you want for your community Many times people come to me and just say I want to do the data thing Tell me give me general metrics about your community And that can honestly be a pretty frustrating place to start from because I don't really know what to give But I also understand why that is the first spot. You have to kind of figure out where You want to start and go from that's why I really think the starter metrics model that's been set up is a really nice Entryway because data can be so overwhelming How what do you do with the millions of rows and columns and all the different points and trying to figure out Where to start and so that's a good generic starting point But then you want to start going in a little bit more about what you want to learn about your community And start going down that path And I'll say that as a data scientist working in the community data analysis space for over four years Once that technical architecture there is there the difficulty lives so much more on the concept building around your visualizations And even if somebody's technical setup is much different than yours You can really start to collaborate on the concept side of it and start to make things that are really interesting That's something that I really like about the chaos community is that there is a lot a really large span of tools that people use But we can all come together to be able to talk about what we want to look at what data and what questions we want to ask And that makes all of our analysis better so Once you start getting that initial question or metric then you can start convert converting that question to a metric And the first thing that I personally like to do whenever I start to Ask a question is look at the research on the established metrics already And that's what chaos kind of does really well is Gathering all the different metrics that people have talked about There's a whole metrics model groups that has done pages and pages And so once you have that question those pages of resources go from being overwhelming to being incredibly useful And that's why you kind of have to start down that path yourself Because even those even though those pages of information can be helpful It can also be very overwhelming and so once you kind of get a pointed start It start making it a lot easier to Go along this path And so from there you want to start looking at the specific data points that you need This is when where your data engineering or your technical architecture is is going to determine whether those data points are available Have you set yourself up for a really general case? So even though this might be an analysis space that you haven't looked at before Are you set up for it? Or is this something that you need to establish start looking at the data collecting storing and cleaning it And so that will be your next step Then you would like to go into how you want to visualize and represent that data And then go into what potential insights and actions And this is really the first stop on your visualization building process for your communities Once you have that first work in pro progress metric That's when you want to go over and start getting your community feedback You with all of these metrics and research from the like chaos community and others You get a good starting point But that doesn't take in the new unique qualities of your own community and people within your community are going to have Some knowledge that you might not have either there might be some type of anomaly And there was some event that you didn't know about and somebody else that was working in documentation does And they're going to flag that off for you But that is going to be the process that really starts to fine tune what visualization you make and making it very useful for your community I'd say very rarely does one Exact fit once you start getting to more specialized metrics work for two different communities Especially if you're looking at communities of different sizes Let's see So now that we've talked about things from more of a conceptual standpoint Let's look at this from a data standpoint I would say very rarely are you going to need a single data or metric point about your community I would say I've not really seen that scenario too much within a community where it's like This is the one thing we want to know and then nothing else And I would say a lot of times you would very rarely end with that same question that you started with Even in this hypothetical scenario that you want to learn that one thing By setting up for a more general case from a specific data source Every question that you do every visualization you make Next we'll take out the first two or three of these steps in the data visualization or permit to the data science workflow You already have established getting your data Put it storing it structuring and cleaning it and you can now start looking at making your visualization hosting it A lot of that architecture and background is set if you are prepared in the beginning for the most general use case Um, and I would say gearing your tooling or your setup to one specific question can really limit you And your community and be very time costly in the long run I can say I worked with the community that built a really large tech stack around of one very specific question Around a community and then they came back to me and wanting to do even just a minor pivot off of that And I had to tell them that they're the way that they had set up their architecture didn't allow for that type of pivot And so they had to go and set everything up all over again And find a new solution to be even able to answer a question for example Instead of just looking at the total number of contributors But seeing how frequent different contributors are contributing And so whenever you get too honed in and not looking at what all is available from a use from a like data source like github Um, it can cost you a big time in the long run And I would say as you go through as well once you learn more about your problem space Which in this case is your community's data and start working with the data A lot more which in the data science world many times is called eda exploratory data analysis You're going to start pivoting more and more through that process and you'll become you'll be able to get a visualization or metric That's so much more useful than that idea that you came in right in the beginning So these are some of your common open source community data sources I'm curious to see does anybody in here take data from some of these sources for their communities I'd be curious with that are y'all doing like json dumps. How how do you get this data? Just pulling straight from the api Anybody can call api api everyone api okay Shocker That's what this next next slide is going to be about whenever you're looking at those common data sources The retrieval method is almost always going to be an api Um and with those out point um outputs being in a json or cvs form or um csv format And this isn't unique to community data. This is pretty common across all data sources on um online And I would say in the cases when you're not getting it from an api Maybe getting it from a direct download or scrape that in use that in product is still going to be a json or cv or csv and beyond a one time use case Having data in a json or csv can get complicated really quick I have a feeling that at least one person in this room has something a file in their computer titled data underscore dump 135 and you don't really know if that was the one that from this week or last week And that just starts to get really complicated really quick if you're trying to even do the same visualization twice a month later That means you have to collect your data again stored another file Then you have another file in a folder that you might mix up and it just gets complicated in storage wise very dense quickly And a work a full workflow cannot be based off of a fragment and that's what a lot of times those one use cases are so And many times and data in the data science world in general Um a relational database is the tried and true method to store your data for repeated use Which database you choose to use for your specific use case or preference doesn't really matter It's more of that relational structure of your data and that structure being very consistent A relational database in our case among other many other things Allows for mapping between different types of contribution types and the contributors that performed them The data is cleaned and it can be used in a reusable format I mean you're able to whenever you have that structure data Looking at issues and prs and commits and how the same exact contributors are contributing in all those different fields You're able to do that whenever if you're trying to look at that from a json file Trying to bring all of that information in and transform it into a similar format That's something that you have to do every single time whenever having that relational format does that for you And so whenever you're Setting up your data collection for that general use case you have all the data in that specific case I'm going to talk a lot about github here. The structure is clean and consistent And there's a lot of visualization tools that are set up to plug and play to a relational database You can think about grafana super set that data that is in there You can start to make those um high level visualizations right off the bat And you don't have to invest in deep levels of Code or preprocessing to make those initial sets of visualization And you can also start to evolve your analysis much quicker because of that ability to pivot And another portion of this is that that data structure when it being um being standardized It allows the preprocessing of the data to be able to be reused along across many different visualizations Each time you analyze some something the next time you want to do it in the exact same way It takes little to no time investment And if you want to do something that's vaguely similar It takes a fraction of the time that it did with the first iteration Obviously if it's more similar it's going to take less time less similar more time And there's kind of two different ways I like to think about this And we're actually going to get to see some of these visualizations in a second One of those things is a structural similarity I like to kind of think about it from a Puzzle piece is that the structure is all there We just have one portion of that puzzle that is being taken out and it works the same way We can think about this from the time to first response for issues and prs We can look at it preprocess the data the same exact way to see That first response and whether that is issues or prs. It's very little difference Another another example of this is looking at the staleness Which is how long has it been since an issue has been open Versus being it being closed or merged how long has it been staying in that pending category Whether that is issues or prs. There's a very minimal switch. It's really just Closed and merged as an additional option But even with that whenever you look at the data if a pr is merged it's still counted as closed And then another way to look at this is the data similarity And so many times you can take the structure of the preprocessing you've done for a similar data type move it around a little bit And get a completely different view on the on the community that you're looking at One example of this is looking at pr review assignments per contributor or status counts And again, we'll kind of look at this example in Oh second But the idea here is that some people want to know how many prs are being assigned to specific contributors Is they're contributors that are being incredibly overloaded with that responsibility Versus others and wanting to see how that How that responsibility is distributed over time And maybe there's a case where you just want to know and the aggregate is my community assigning prs Or is it not is it assigning issues or it's not and in some cases that might not really apply to your community But this is this is one of those data similarity standpoints Another one is looking at fly by contributors versus repeat contributors So you would want to maybe look at what are the type of contributions that are done by people who maybe only do Four or five contributions and then leave the community Or who is what are the people who are staying in that community for a consistent and long period of time? What does their contribution activity and types look like even though those are looking at the inverse of each other The data processing in the code is very similar So how does this look like an action? So i'm going to show and my specific case the tools that we use that fit under this framework But there are many tools that fit under this framework. It's kind of just the idea here And so for our case whenever we look at the database we use a project under the chaos project that is called auger And it takes the mountains of data that comes from github if you've ever worked with the github api You know pain And they have done a lot of pre processing and structuring that data in a large postgres database And so i'm able to as a data scientist Just go and do sql queries and i get structured data that is cleaned And i can see things across many different contribution types and see what their contributors are and doing across the Different repositories and that's not even just with them well in repositories. I can see If a contributor what their activity is among all of the repositories that i collect on and so having that structure Having the data engineering and having validated data makes my job as a data scientist so much easier because I can focus on my analysis Not on making sure that my collecting my data Cleaning it because it that's a huge undertaking And so if we're going to look at like the high level architecture for auger The postgres database that's what I touch whenever i'm doing making my visualizations The auger tool is the one that communicates with github's api and does direct pulls from just like doing like a Get clone to get a lot of this information and they clean it structure and put in that postgres database and from there We have our eight knot dashboard. So we take this structured auger database It's a relational database with the git platform data With enforced data validation So all of those things I can know to be true And then built on top of this is a dashboard that we based in using dash plotly Which those are packages in python We really wanted to use The python tool set because we knew just I really wanted to capitalize on the 20 plus years of data science research And backing that has gone into the packages in python And we were able to do the preprocessing in python and do the visualizations in plotly With the in that with all of the data from auger So it's kind of that plug and play if you've ever seen grafana or superset You can kind of think about it the same way we get those database credentials We now have our dashboard and we're able to make these visualizations Sweet now we're going to do the a little bit of a demo so you can see some of the visualizations. I've been talking about We did a workshop a couple days ago and I lost all the slides trying to switch over to the demo So i'm no longer allowed to do that Sweet, thank you. Um, so this is the eight knot dashboard You will be able to see a hosted version of this online We'll have the link to this in our final slide and come and talk to us about it This uh, initial welcome page if you'll do go and look at this a little bit more There's a bunch of different tabs talking about how to use the dashboard How to make user groups, which is a combination of a bunch of different repositories How to get the database to start collecting on the repositories you care about all that good stuff all that information is right there Um for your use, but let's look at some of those visualizations that I was talking about And this is also a really good example of what don was talking about earlier of the wall of metrics There is a whole lot of metrics here And which ones do you care about looking at if you haven't taken the step beforehand asking yourself what questions you want to ask There's a whole lot of colors and bars And they don't really mean very much until you start to get a little bit of a perspective of what you want to know So one of those examples that we were talking about where you're kind of using similar data But changing around the code and the structure a bit you have this um pool request review assignment So it's looking at the assignments by contributor versus the status count Um, and then whenever we're looking at two different. Oh, I don't like that Um, we're looking at the stailness of like issues is very similar to the stailness of activities Or the stailness of pull requests. And so those were a couple of the different graphs that we were talking about Um, what time is it? We have a little bit of time. I'll hop around a little bit Another thing that's people seem to really be interested in is the contributor affiliation. So where Are your contributors contributing from if you're able to identify if they're associated with any Individual group or company This is one of those things where you're able to take I think about this as a general page is a general question of What is the organizational diversity of my project? It would be great if you could just get a list out of numbers and what how many people are individual contributors How many people are for which company? We don't really have that data But we can start to at that question from a lot of different angles to start to get a holistic view on the project's makeup And so an example of this is that there's two different graphs right here and one looks at the unique Um emails of all the contributors So even if I contribute a thousand times my email counts once if somebody contributes once their email counts once So that gives you one view while the commit activity by domain That's going to be proportional to the amount of activity that's in the repository And so it's not going to tell you the full picture from either one But you start to be able to get a more full view of your community when you start to look at things from different angles So This is the app. I'll let pass this over to dawn and if people have other questions or things they want to look at at the end We can do that as well Okay, cool. So I am going to show you Uh a little bit about the starter project health metrics model After I Just let me let me reconfigure everything And then uh, and then we'll be good Okay Okay Perfect. Nope. Almost perfect Hope it's this way. Okay. There we go. Now. I've got the cursor on the screen that I'm looking at These things are important Okay So cali talked a lot about gathering data You've seen her demo of eight knot using auger But it's also important to do something useful with all of this data, right? So the data is great but we talked earlier about the tsunami of data And tools like auger eight knot and other visualization tools as we've both mentioned don't actually solve this problem for you You still need to think about what questions you have And what you want to learn about your project to help you focus right focus on the data and the visualizations that you actually need to see So in the case of open source project health metrics I did create the starter project health metrics model with four very simple metrics That I mentioned on the earlier slide about chaos metrics models So if you're still figuring out which questions to ask You can use this data as a starting point to improve your project in your community Now not everybody needs this if you're already experienced in metrics You're already using it then you probably you probably don't need these at all But they're a really um, we found them to be a really helpful way for people who are just getting started And i'll talk about a few ways to interpret this data So that you can use what you've learned to make actual real improvements And then hopefully what you learn from these four metrics will generate new questions and new ideas for what additional data you might need for your project I recommend measuring contributor sustainability using bus factor, which is also called the lottery factor Because it helps you understand what might happen to a project if one or more of the key people on your project won the lottery So if you're tired on a beach and very suddenly left your project with little to no notice and never came back There are a couple of things this metric can tell you first of all How big of an issue is your current contributor situation if it's like this one You really should focus on getting a few more people who can contribute and eventually be moved into leadership roles like maintainers or reviewers for example Now this graph shows the top 10 contributors to a project and out of those top 10 These are the percentages of contributions So you can see that there are two people who are responsible for quite a few Of those contributions, but there are also quite a few other people who aren't that far behind them So this one's this one's looking pretty good But one reason I really like this metric is that you often find that there are people who are contributing more than you realized And this can help you think about who you can encourage to contribute more And maybe find someone who could move into a leadership role Reaching out to someone and acknowledging their work while encouraging them to do more Can help quite a bit with growing your contributor base Sometimes people just need a bit of encouragement And you can ask them for specific things that you know that they're particularly good at There are several communities that I've personally gotten more involved in because someone came to me and asked me for for my help with something specific And in this case, you might look at the people making fewer commits to decide maybe they're ready to become a maintainer Or if not, maybe they're good candidates for becoming You know meant for being mentored for mentorship Or becoming reviewers with an eye toward then making them maintainers after they get just a little more experience within the project Now the catch here And with many metrics is that we really we don't just want to think about the people making commits and pull requests It's a good start Right But you should also be thinking about how you can move people into maintenance ship or other leadership positions To be responsible for things that might not show up in a repository like documentation community management marketing product management all of the other things you need within your project That are really important roles for you to be successful It's really important to keep up with requests and resolve them in a timely manner Even if the response is closing requests that aren't going to be merged, which can be a difficult conversation And it is very easy to get behind on incoming contributions. Uh, we we all get behind sometimes But not addressing these contributions promptly creates technical debt And reduces the chances that they'll ever be merged because those old change requests or pull requests Are likely to have so many merge conflicts that they're just going to be too difficult to accept after a certain point in time Now it can be tempting to solve problems with responsiveness By putting some extra pressure on the existing maintainers and asking them to respond more quickly and resolve more contributions Um, but this rarely actually solves the long-term problem You might get some short time gain short term gains But it could be damaging to the community and the project over time If all you're doing is burning out your maintainers by not resolving the underlying problems that are causing the lack of responsiveness in the first place If you see responsiveness declining It might be time to move more contributors into leadership roles and promote some contributors to become maintainers for your project As I mentioned, uh, on the contributor sustainability slides You can start by looking at those contributors within your project to find people who are contributing regularly But maybe aren't already maintainers And a good first step really is to promote some of those people To reviewer roles Where they can reduce some of the load on maintainers by reviewing contributions from other community members More people reviewing contributions should provide better responsiveness Since maintainers will have less work to do for each contribution And then they can focus on the things that require maybe more expertise within the project It's also important for projects to respond to requests in a timely manner Because a quick response can help you retain contributors who otherwise Might become discouraged if they just never get a response or get a response months later after they've forgotten that they even submitted a poll request So timely thoughtful and kind responses to contributors indicate that you appreciate their work In both of these responsiveness metrics, it's really important to focus on the trends If responsiveness is already improving great keep up the good work However, if you see responsiveness declining Then it might be time to find ways to improve it including as I mentioned earlier recruiting more contributors and maintainers But talk to your maintainers And find out what else they spend their time on if maintainers spend a lot of time on things like Community management documentation or other tasks Then maybe recruiting people to help in those roles would be time well spent to free up time for maintainers In other cases improved documentation might help reduce maintainer workload For example, if maintainers are spending a ton of time onboarding contributors Are answering questions about the contribution process Maybe better onboarding docs or contributing guides would free up some of their time to focus on responding to incoming contributions It can also help to use templates for things like issues and merge requests or pull requests Since anything you can do to help contributors make good contributions That require less work from reviewers and maintainers can really help improve responsiveness The other point I wanted to make on the slide Is that there are a lot of different ways to visualize data using the chaos projects auger tool So in this case, I wanted to focus on the trends and display some additional explanatory data above and below the slide So I created these visualizations by writing queries against auger's postgres database So this is the exact same database that cali uses and eat not Which I used to collect the data and then I just use standard python libraries like a matplotlib and seaborne to actually just create the visualization And again, this is a custom visualization primarily because where an individual project releases varies quite a bit depending on the type of project that it is But regardless of where the releases can be found It is really important to look at the frequency of releases including all of the releases even those little teeny point releases Because it's critical that security updates and bug fixes actually land in a release in a timely manner And it's important to get those new features out as well And an appropriate release frequency for your project is influenced by the size of your project Maybe how many dependencies you have on other projects that are also releasing fixes that you'll need to incorporate For projects that you contribute to you should think about whether you're cutting releases Frequently enough to keep the project up to date and secure Now the catch is that none of this is something that you get to do once and then just forget about Regardless of what metrics you're using you should be thinking about how to make this a continuous improvement process So the starter project health metrics might be a good start But you probably want to add a few more metrics later to better understand other aspects of your project And then when you take action to make some improvements You'll want to monitor those metrics over time to see if the actions are making a difference And to see if any improvements actually persist over time Here is a link to the top link is to the chaos community where you can find our metrics our meetings our software our podcasts our blog posts And all kinds of information about how you can participate in the chaos project Or just use the stuff that we create that's cool too The metrics with an x link is to a software as a service offering hosted by the chaos project Where you can use auger and eight knot you can add your own repositories As long as they're public or view ones already added by other people You can also have a look at the eight knot repo if you're interested in learning more about how they've created their visualizations And you can contribute new ones So it's fairly fairly straightforward to build on the existing visualizations and create new ones that are are similar that That meet your particular needs And i've also added a link to an article called beyond the repository Which it's a short read. It's something like six pages Um But it talks about some of the things you should be thinking about Like responsible use of data when gathering and analyzing data about open source projects So it's something that I just kind of recommend to basically anyone doing research or open source metrics gathering Now throughout this whole presentation, we've talked about how the data for open source projects Can just feel like a tsunami of never when ending overwhelming amounts of data Callie talked about how to use a data science workflow as part of your process For thinking about how you can use data to answer the questions you have about your project Along with how to gather your data and store it in a structure Sorry, a standardized data structure like a like a database for example To give you the flexibility that you need to answer additional questions in the future so that you can reuse what you've already Gathered to produce additional visualizations from that same data You should also think about how you can build on the work that other people have already done Right like the metrics and software that we've developed for the chaos project for example And you should work with your community to help understand what the data is telling you so that you can build a more rich Interpretation that helps your project make improvements But having the data and the visualizations is really the start of the journey. It's not the end You need to understand what it all means for you and for your project By coming up with actionable insights and ways that you can improve your project as part of a cycle of Monitoring the data to ensure that you're continuously improving and learning as your project evolves With that, thank you and we have plenty of time for questions So we're short on microphones If you have a question shout it loud and then I'll just ask Callie and Don to please Make sure you repeat it so the people streaming will be able to catch it on the mic Okay, go ahead So so what do we say about goddard's law, which is once you make a metric public it becomes a bad metric um Do you want to start with this or you want me to Okay Yes, anything that you measure uh will likely be be gamed by someone So uh one of the things that I've tried to do so in particular when I was at vmware I really tried I you know, I used those four starter project health metrics with all of our projects Because my thinking was that because these metrics really are about improving your project That most of the things that they could do to make those numbers go up were probably good things So I try to make it so that if they if they do try to game it so that so that the numbers are better That hopefully it will be done in a way that Makes the project better However, I've also seen it go terribly wrong because um at a certain company that I worked at One of the business units decided that they were going to take okay ours So basically how they were going to measure the performance of their employees were based on my Metrics and whether or not because I had like a kind of a green red scheme Um and my response to that was oh my oh my gosh, please. No, please don't do that Because I've set random arbitrary thresholds that are what I think work for most of that company's projects But they're they're not there's no one-size-fits-all metric like they they blatantly won't work well for certain projects those thresholds so um What I talked them into was that they they should make an improvement To to something every single you know quarter or whatever whatever it was So that they could show that they made some improvement into their project So it could be anything and as long as they were improving something Um, then then that was that was good Yeah, I have a couple of thoughts I'd say the first one is that this is why I like having multiple visualizations That's looking at like a similar space if we look at the like affiliations page For example that has a couple of different visualizations that are all looking at similar things I think that helps prevent the gamification problem because once you're starting to shine the light in a couple of different angles If they're all showing a positive trend, then I kind of go to what dawn's saying that it probably means something Something positive. I would also say that once you start getting into a larger Um visualization portfolio. I think it's pretty hard to start game at like I would dare someone to try that. I don't know if I want to dare but being like Game by all the metrics one ain't not sure that'll take you a lot of time But I think the quickest way to gamify them would be to just make improvements in your community Yeah, I can I think everyone you ask in this space is going to say something a little bit different And so I'm kind of glad that dawn's here as well because I think she's going to answer this very differently than I would Um, the initial thing that I would say is that if you are trying to look at those Repositories and aggregate if all of them group together is what really matters Then I would say, okay, then you can start looking and using like eight knot and I I'm too scared to touch it And you can put in the search bar all of those Repositories that's when you could create a user group And so you can start looking at all those metrics in an aggregate way And then if you're trying to if it's all about the individual ones And I would say that you would start to look at it more of an individualistic way I think that one thing that's also really interesting is there's a Project velocity graph that we have on eight knot that you can choose what the weights are for issues prs And different actions of what matters the most and you can see the velocity which is something that you choose Of those Repositories next to each other you're looking at like activity for issues prs And then it's the number of contributors is the size of like the dot and then the axis I think the other axis is commit but Looking at visualizations that will put them all in a comparative chart So it's more matters of whether you need to see them as individuals or if you need to see them all together Yeah, and then just to add to that those those starter metrics are really just designed to get you thinking about what you need What I actually recommend that um projects do or companies if they're um, you know Shepherding some open source projects is to really really take a step back And think about what your overall organization so your project or your company what they're trying to achieve How the work that you're doing within that project fits into that broader structure And then what metrics do you need to show that you're successful? Um in whatever this criteria was so this is what's going to be different for every single project Like you can start with some things to get you thinking about what your options are But you really should at some point Really take a step back and figure out what you need to achieve and how you can justify it Because it's something particularly in the you know the financial climate that we're in right now, right? Like it's it's kind of hard like companies are looking for for things that they can That they can cut right and you don't want to be that thing that gets that gets cut So if you can use your metrics to justify why the work that you're doing is important And why you're succeeding at that work Then you'll be a lot more likely to make it through the next wave other questions Yeah, so the question was how do you measure those those other things that aren't code metrics? So some of the the conversational things We actually have a lot of metrics defined within the chaos project that aren't based on what we refer to as like trace data So the stuff that you can just get out of apis But focused on on things like in particular like diversity equity and inclusion A lot of that is best measured by surveys You know, how do people feel about the things that are happening within within the community and their their place in it So surveys are certainly a good way to do that And you can't also do like, you know, you can pull a lot of the trace data about conversations and do sentiment analysis and things like that One of the things that we've done quite a bit within like kubernetes community, for example, is try to push most things into Into github, right? So even things like like Organizing the contributor summit Those those are based on issues and markdown files So people are contributing to those and then you can see some of those other contributions for things that Might not necessarily always happen in in the repository similar thing for documentation and Same thing for like product product management a lot of that's done in like In the case of kubernetes github product or project boards github project boards So you can push a lot of stuff to some of these repositories for things that You might not naturally think of in that way I'm looking a lot right now at response frequencies So whenever you look at issues or prs, what are the conversations that are happening around them? The first response is kind of that initial look at it seeing how quickly are people getting responses on either one But also how is the conversation going going like not even just looking at that first response seeing if When the issue opener or pr opener how long in that conversation thread are they waiting For another response and then something that I want to start looking at kind of that next stage of this Is seeing if there is a very large skewed behavior On responses for for like different groups of people So say if you look at like the top 10 contributors maybe and they're How do they conversate with one another than maybe how how is the response to like a new contributor? Or looking at different ways to break down the activity because I think that there might be some Trends that you'll be able to see that some people get really fast responses And then some people don't get responses at all and starting to take it a little bit more granular Let's have a round of applause. Thank you. Callie and Don Floor con and the hall con are also still going on Okay One two one two One two one two one two Can everybody hear me? All right. Thank you There you go Hello, good afternoon everyone. Uh, can you hear me? Is everything good? All right. Awesome. Thanks Oh All right, uh today we're going to be presenting about how to use open source tools to serve ai machine learning models on private infrastructure It is going to be a hands-on slash some parts were recorded, uh, but The most interesting parts of our life So, uh, first of all, we want to present ourselves. Who are we my name is cyan for it? I work with arkis nexus, uh, which is a software development company Uh, we want to take a look. We have a qr code at the end of the presentation And yeah, I'm i'm aron vega. I'm also with arkus. Uh, been working there for several years Uh, I work mostly on back end services and for the past few years I've been specializing in cloud computing and infrastructure. So that's that's a lot of A lot of this is that's from where it's coming from. Um, so the problem that we saw Why we are doing this is that a few months ago. I asked a question to my friend aron I am not a very technical guy. Uh, so I I asked him how can we implement a private and secure infrastructure to manage and local lm And that it doesn't require any Cloud services and that it's all all open source and using open source tools and technologies so aron gave me The the talk about how kubernetes and how we can use it So What I understood is that it is very complicated to actually do that And setting up a private infrastructure in custom environments is not easy. So that's what we came up with this poc So the proof of concept objective is to simplify the process of setting up a private infrastructure With any custom environment For this demo, uh, everything is running in the local machine And the use case that we are going to be showing to you guys is an image recognition app that help us find tacos in the youtube videos Why tacos because everyone likes tacos, right? And So so yeah, uh, and we chose the image recognition Use case because it's the simplest one is the hello world of the ar we've seen So it helps us keep things tidy and easy to explain. So that's the reason and also No, just go ahead. All right. Uh, a little bit of high level project overview We're going to be using ultra latex yolo v8. That's the model for the image recognition Uh label studio for preprocessing and tagging our images jupiter hub for the fine tuning and Yeah, basically fine tuning docker and kubernetes for portability and deployment management Uh, so for those who don't know what kubernetes is, uh, go ahead. Aaron. Yep for sure. Uh, so yeah kubernetes is big uh lately and uh, if you don't know How it works you can kind of assimilate it to A restaurant Let's let's say we have a restaurant with multiple chefs that specialize in different kinds of cuisine and each each of each Sorry each chef has to focus on just only the ingredients for that cuisine for that style of cuisine and uh Preparing all the recipes that are according to that to that cuisine and then we if we have that scenario We would need a chef manager that is looking up for the needs of all of the other chefs Making sure that they all have the appropriate Ingredients for for the cuisine and and also making sure ultimately that Every dish is served according to the order that came in right so so basically that's the job of kubernetes in tech We don't work with food. We work with the services. So Basically kubernetes is a manager for for your services. It helps you with stuff It helps you help check all of your services So if you if you need to scale up or down kubernetes can take those decisions based on rules that you Uh configure it to work to respond to So yeah, uh We're gonna leverage kubernetes to make this workflow work. Uh, the yolo workflow. Uh, it's pretty simple. It's it's essentially three steps just Uh preprocessing your your data set in this case data set is a set of images tacos And then we will use uh label studio to tag those those images In order to indicate to to the model which object it needs to identify Uh, so so that's that's the first step the second step We're going to be using jupyter no books to fine tune the the yolo v8 model Basically just running the the training the training api of that ultra liquid for the ultra lyrics provides taking it from our data set That way we can we can specialize our model to find tacos, right? And lastly, we're just gonna serve a small application that you can input youtube video and it will process that that video find all the tacos in the video and ultimately mark them up in in the video So the tools that that we're using we we we mentioned some of the high level tools But these are the behind behind the scenes tools We're going to be leveraging docker and for containerization and Portability all all of our tools are going to be running inside Containers, which is the the way that kubernetes needs their services to in order to be deployed and use and then For the infrastructure managing the infrastructure. We're going to be using cdk8 Which is similar to cdk from aws. I don't know if you're familiar with that But that's basically just infrastructure as code. You can you can manage all your all of your infrastructure instead of using yml's co-formation Manifests and all that you can use code And cdk8 works pretty much in the similar in a similar way just specialized for for kubernetes manifest instead of aws And lastly, we're going to be using the helm. That's that's just a package manager for kubernetes Basically, it allows you to Bundle a bunch of tools into a single package that you later can install in in a single as a single unit in your cluster So yeah, the the setup that we have here, we have already Skipped ahead some some steps because just on the best interest of our time it is all of you all That you're going to be seeing here. It's it's already streamlined to a proper amount of time, but these processes take a lot of time because a lot of these images are on the geeks So downloading them and installing them into into your cluster. That's that's a Slow process. So keep in mind that whenever you you try this it's going to take a a lot more time That we're going to take here on on this presentation Uh, that that's also the reason why we already set up some some parts of the of the cluster and we didn't do the whole thing live So yeah basically the first step the first step for for our project was setting up a cluster and We needed a couple of tools for that since we the point of of this project was running The workflow in a private infrastructure Whether that be because you have privacy concerns or because you only want a like a quick lap kind of setup or something smaller That's that's several that's some of the reasons that why you would need to do this So so Some some of the components apart from two cases the low balancer and the gpo operator This those are core for running in your custom custom hardware If you run this project in a cloud provider You don't have to worry about those components because the cloud provider already takes care of them It's already set up you then you you basically just spin up your Your cluster and it already has access to to the gpu if the if the cloud provider allows that But we're gonna be setting those up manually in our case Uh now for for for these demo. I've already went ahead and set up the the gpu operator Which you can see it's a bunch of services and and the the ones that that we care are This lot over here at the bottom Especially the demon set that that's that's a demon set that's running inside the cluster that can't talk to To the hardware on the computer so we can make use of the gpu in order to either train or serve the The immature commission model Similarly I've also set up a load balancer, uh, which I Can yeah Load balancer. It's pretty simple It helps you manage traffic incoming to your cluster and It also tells uh, or allows Kubernetes to know when it needs to scale down or up or or Allocate more resources to your to your services in case in case it's needed, right? And as I mentioned if you're running in a cloud provider, you don't really need to do this But if you want to make a point of running into your own hardware hardware, you're gonna need to set it up So basically that's it and and yeah, this is part Some of the parts that we already Uh pre-recorded Uh, this this here that I'm playing. I'm just setting up the same the same environments that the same components that I just mentioned I'm first creating a kubernetes cluster Uh, and then I'm going to install first the the load balancer and then the the GPU operator and keep in mind this this is this video. I'm going to actually skip ahead Even if it's short Uh, it's about two minutes long. Uh, but this process takes about uh from 15 minutes to half an hour. So so Being mindful of that just because uh, the gpu operator sometimes takes takes a while to to spin up and recognize the hardware, but but essentially we are just going to to apply some of the of the Uh Of the files that we already have in the project. Yeah, actually I I missed that part. I'm sorry All of these that I'm going to show here. It's it's an uh in a public GitHub repository that we're going to share at the end So so please keep an eye out for for the link and that repository has also a read me Which is documenting all the processes setting this up From scratch and also it includes a branch which is called start from here Which you can check out and you know start from scratch Basically, you you already gonna have the the load balancer and the gpu operator But but basically you can use that that branch to bootstrap your your own uh lab Uh, so yeah, that that basically is just running a bunch of commands. There there are some some Workarounds as well because we are using kind kind is a development cluster for Kubernetes And it has some issues with with uh with hardware. So we have to find a workaround for that. That's that's the docker Modification that we're doing here. Basically, we're just modifying or creating a sim link inside some of the containers in the cluster In order for them to be able to see the the hardware Sorry, okay Yeah, I actually No, it's in the other step. Okay, so the next the next step once we have a That's basically a verbones Kubernetes cluster what what what I just showed you just setting up the cluster Now we need to add our tools that we're gonna need for our workflow And we do that using help charts as I mentioned earlier. That's just just a simple Not a simple A package manager for Kubernetes that allows you to bundle up different tools into a single unit Uh, so the two tools that we're going to be using our label studio and jupy off. I'm going to show you some code and I'm gonna Try to I didn't open the code. So Oh, sorry. Is that better? Yep, like that. Okay. Okay. So, yeah The first thing that you're going to notice if you follow or read me is that uh, when once you bootstrap your your Project, it already has an infrastructure as code folder and with some with some files Uh, these are basically the files that cdk8 Generates by default when you start when you start a project I just added a couple more more files to handle the infrastructure that I was just talking about the load balancer and the The gpu the gpu operator but Cdk8 in order to allow us to create things Has the concept of charts which are similar to to helm charts, but not quite the same the same thing with cdk8's Charts you can define your own constructs that are going to be applied to your cluster afterwards So basically you you code all of your infrastructure. You run the the synthesize Command and that generates a bunch of yml files that you need to apply to your quivernets cluster The the advantage of this is that it is way easier to manage. It's way it's way less cold and Effectively you can also, you know reduce the need for for the DevOps engineer, uh, because this allows developers to also Uh Step in into the the ops realm So because it it makes it more easy for them to manage and they are familiar with code. So so yeah That the thing this way works the thing The way cdk8's works is that we need to create one one chart where or multiple charts with the resources that we want to install into our Or cluster what I'm defining here is an ai chart which I I have added two helm charts, uh, which are label studio and and Jupiter hub and I want to install those into my my cluster All of these are are documented on their respective Home pages or the documentation pages uh jupyter hub has a helm chart, uh And it provides all the all the parameters that you can pass into the into the helm command Which are the same that you can you can tweak here on the under the each of these props for for each of these, uh Helm charts So how does that look? Once we have ready We have or Sorry There we go Once we have our code ready the only thing that we need to do is run the synthesize project the synthesize command I'm sorry that that's the thing that generates the yml files that kubernetes needs to know what it needs to run So basically, uh, we have two files here. Well three files two are from the previous, uh Infrastructure components that I was mentioning and the last one is is the one that I just that I just compile Which contains the label studio and and and jupyter hub and Once it create i mean it once it creates that you you saw that We also ran the cube ctl apply Command which is the the command that actually Installs it into your your kubernetes cluster now So if we want to see the state of of that cluster is going to be oh, sorry So, yeah, basically this this this namespace contains all all of the tools that I that I have installed into that chart that particular chart that I was showing and the the outcome of that is that Again, it spins up the the services for both jupyter hub and label studio, but it takes a while to to Become completely green So so be mindful of that But once that that's done and you can see that everything is running fine in your in your cluster You can go ahead. I'll actually know I I almost missed that You can go ahead And and check out under services both both of these help charts create a service that exposes That particular application to traffic outside the The cluster And you can check those in here. So basically for label studio, we have this this External ip Oh, where it is Oh, here it is. So This external ip which if I go into Into the browser Open a new tab and go into that You can see that we have our label studio So Instance running right and the first time that you come into into label studio. It's going to ask you to create a file a file a profile Since this this this is a local setup for the moment. It doesn't really matter what you input here But you're going to need to create a a new Oh, no, it's already created Yeah, yeah Come on There we go. So yeah once once in label studio what we want to do is create a new project Uh, which which is the the construct that we that label studio uses for for us to upload our data set So in this case, I'm just gonna do some Name it tacos. That's what we're going to be doing And and here once you have your project set up you can import your your data set You upload your files. I already have it ready here So you just How many images? Yeah, that's a small data set. That's not Yeah, that's around 40 pictures. So yeah, but the real data set that we train the model with That's about a thousand images so yeah So before you start tagging your your images, it's going to ask you to set up your project, which is which is obvious So since we're doing Yolo, we we want I mean we can set we can choose from several of the other Template there, but the the one that I'm using is is that simple square text. So it already comes with a template. I'll just need to to Replace the provider labels Create a new one you can create as many as you want for for the demo I'm just going to create a single one. So so I can We can see the proof of concept So once you save that, uh, did it save? Yeah, I think so You can go back into your your data set and start and start processing we already see that we have here or or Tacos labels down here, which is which is something that we want to use to tag Images, but in this first image, we don't have any taco so we're going to just keep it and go into the second one And just select this label here and then we can just you know Tag it like that submit submit the the frame And you know you get the idea you you have to do this a bunch of times until you no longer wish to work on what you're working on So yeah, uh, once you have done all of your your, um Oh, sorry, that's not it all of your data set You can you can export it and and the reason we chose label studio for these demo is because it allows us Uh to export it in multiple multiple formats Uh, but especially the jollo format, which is what what we needed to make it work with with jollo v8 So you can export that and should download a tip with with all of your images all of your label Data and all of your tag data as well the the coordinates in which The images are are tagged now Yeah, once once we are done with that we we can jump into into jupyter huff and fine tune all our our model So let me show you also how we do that. Uh, however, this is also a very time consuming Uh process so so we already have it. Uh, we run this like two hours ago Basically, uh, we think your jupyter notebook. Uh, it basically spins up and uh an isolated pot inside your cluster environment your cluster Your kubernetes cluster so so you can install all of your dependencies in there You can have multiple users having multiple, uh, you know Pots of their own, you know or having multiple users Access the same the same pot so they can work in a collaborative manner Uh, and and all that kind stuff. So yeah, we basically what we did here was just install the ultralytics Uh dependency And once that that was installed we just ran a small, uh You know, we just ran three epochs just as a proof of concept But we are running this in our cover net is cluster, uh in our local environment Uh, sorry Okay, uh, that's that's for the training part of the of the workflow But uh, once we have all that done we also may want to serve or or model or or make it available To some application or service or whatever, right? Uh, and the process to do that with kubernetes is is a little bit It's both straightforward, but that doesn't mean that is easy. All you need to do is is uh, Create a docker image and pull and polish it into into Registry whether that be private or public. That's up to you But but kubernetes needs those those those images in order to pull them and spin up new pots According to your your indications, right? so Our our application is basically just a wrapper around Let me just It's basically just a wrapper around the the jollo v8 api Jollo v8 provides a pretty neat api to either Keep training your your model or just just you know predicting or inferring stuff from from from images from Sorry from video or images And all all I had to do was just wrap that around an application that that the receive request Either to to start a session or to start predicting Uh, so And all of that is uh contained in a docker images docker image that we can see the definition here We're just building the application, which is fairly simple. You can you can look it up in the in the repository Uh, but the important part here is that we need to use the the ultralylics image Because that contains the the tools to interface with with the with the jollo v8 api Uh, you can do that with python, which is easier. We kind of experimented here and we did a Go application that kind of interfaces It's kind of it's kind of gimmicky, but it proves the point. So So that's the reason why why we wanted to do it that way Uh, so yeah, basically you you just need to use your docker image once you have it You can build it and polish it into your registry and Uh The way that we're gonna use it in our project is we're gonna add it into our same same chart You can add it into a different chart, uh, but i'm gonna use the same one And uh, and the first thing that we want to do is is set up all the registry secrets These are Kubernetes secrets that we need to create in order to tell kubernetes The credentials for the docker registry that we're gonna be using Otherwise, it's not gonna be able to to download those images So so we did do that We're using basically the same the same environment variables that docker logging Persist in your system. So if you've done a docker login in your machine You can use this code as it is and it should it should work for you as well So we we not only need those secrets We also need to tell kubernetes what to install. So the first the first component that we're gonna need to to do Or to specify is is a deployment And a kubernetes deployment is is basically i don't think i Yeah So a kubernetes deployment is is basically you setting up a bunch of specifications about your app Which image you want to run? the environment variables and you know Scaling rules all all that kind of stuff. You you can set that up in in in your development You have a pot spec here where you which you can modify the same way with That you would in a yml file But you can do do it directly in the code here so The deployment is is the application itself. It contains the the image that I was just showing you the docker file for But as I mentioned, we also need a service for for the application to be available for for outside from outside the the The cluster so i'm also adding Adding a new a new service and what is important here is that we need to tell the service which port to redirect traffic to So basically i'm using port 8080 in my application. I need to set up my target port On my service to be the same as my application. Otherwise It's not going to throw any errors, but you're going to be scratching your head You know wondering why it doesn't doesn't doesn't work. So so yeah Again if In the read me I go almost line for line explaining what what is what each of these lines do So if if we if you have doubts, we have questions at the end, but you can also take a look into the repository Right now i'm just going to compile it again Let's see And i'm going to apply again But only the the last file, which is the only one that I that I modify the the one for more for AI infrastructure I didn't modify the other ones, but it doesn't really matter as you can see as Right now where when i'm executing these you can see that CDk8 Is is capable of identifying if any resources have changed against your cluster and if it hasn't changed You won't apply any any other changes to that. It will just roll out the new changes to your to your to your chart So yeah, what what that did is that we should see a new pod in our In our cluster Here it is Which you you started like 20 seconds ago Which is the container for for our demo application So to Which is our fame tag finder i'm just gonna Pass in A youtube video I have The logs here which has already started running, but you could also monitor that Directly on your on your pot looking at the lockside. I don't have the Lens metrics installing my in my cluster so not able to see them But you can do a cube ctl as well in order to see that So yeah, if we check at the locks Hasn't detected anything anything yet, but what what this is doing is First it downloaded all the video from from from youtube and and that gets passed into the ultra lyrics api internally ultra lyrics They split up the the video in frames and analyze each frame individually to find The object of interest right so yeah, we have some some found, you know some hits here And it this is also something that that Takes a little bit of time It can be improved a lot if you pre-process your video So for instance instead of passing directly the view from from youtube We do some pre-processing and maybe modify it to say 20 frames per second it the the time the prediction time it How it becomes slower, you know, it's quicker But for now it's it's gonna take about Three or four minutes and and yeah You can see that it's running. Where is it? Yeah, it's running on the gpu directly and it's running from from our kubernetes cluster so Yeah, basically, that's it. Don't know if we want to add something else. Let's just give it some minutes to finish I uh, we can we can check out the result Time well, we are waiting. Do you guys have any questions or something? Okay, gotcha I'm more curious. Where do you find all these like you strung together a lot of different pieces It's just sort of trial and error or there's certain resources you Reached out to for how do we find all of these tools and yeah the tools and you know I you know show the contents of some of these files you've applied for the cluster, but yeah, yeah, yeah There's there's a repository for health charts, which you can You know look look up, but I was already familiar with both jupiter health and and label studio. So I was aware that they had uh health charts provided for for kubernetes. So I would say it's a mix of Experience and research because once we settle on the proof of concept that we wanted to make We also had to do some research to see if You know, which model was the one that fit best for for project and and if the the tools that we had already selected were the the right ones Uh, the kubernetes part is that's that's that's experience. You know, I've been working with with infrastructure for Over five years now and mostly doing kubernetes. So I when Actually, yeah, the thing that we didn't mention is syon came to me because he's working with a customer that does AI stuff There's actually a Like a personal assistant. They have a booth around but He wanted to simplify they they have some concerns. They need to run into uh into Custom hardware or a more local setup And we started looking from there, you know, what how we can accommodate that and make a proof of concept Researching the tools researching the methodology. As I mentioned, there it wasn't a Straight it was straightforward. It wasn't easy Because there were so so many workarounds that we needed to tweet out But but yeah, it was it was we did a bunch of research at the beginning We had a lot of experience, but ultimately we had to also, you know, try on error a bunch of stuff So I would say it's a combination of of all of those Is the model selection mostly just you have some pictures and you run through this process and then see if it Spits out what you expect or uh, oh, yeah, you mean the fine-tuning process, right? So so yeah, basically Uh, you know once you you can run this this is an iterative process So so you run your your training several times and The output from yolo is is you know these checkpoints this weight weight checkpoints, which It gives you the last the last iteration that you you Run and also the best the one that have the best results So you can pick up the the one with the best results and how do you know? Which one have the best results, right? So how do we know when to throw out some some results from your training or not? Uh, you have to monitor the I believe this is the May average precision median average precision I think that's does the name of this stat here. This should increase over over time Uh Until you reach your your desire accuracy So for this demo we just settled for around 75 to 80 percent accuracy So we train our model until we this this value here came close to to that to that value I don't know if that answers your question Yeah On the back Yeah, uh, well we I know that's kind of uh, you know like a quirky show choice that we made because we just wanted to experiment Uh, and actually we the only python that we use is is to is is this grip here? Which is which is just telling you, you know jollo v8 to we need to train again. So so that's that's it Uh, but our application or go application uses the cli actually the the jollo v8 cli jollo v8 Also has a cli which you can use on your local system if you don't want to set up this this kind of stuff Uh, so let me see if this Is yeah, so here here's the result, right? So it's not perfect as I mentioned. We are aiming around a 75 to 80 percent accuracy. Uh, it it Yeah It it gets confused sometimes but but overall it works and by the way have a narrows It's not a sponsor in any way But should shut out to them. I hope they're they're doing great So do we have more more questions? Yeah Yeah Yeah, yeah, you basically that's that's I believe I'm not not 100 sure but uh, that's a partnership kind of thing Ultralytics provides that part of the You know, they they handle it internally. So we we just pass the link and and they download the video on their own So I mean, yep, uh, that's like an api, right? Sorry, yeah So the the question was how do you get youtube to let you have videos? That's not annoying full of ads manually and the endpoint changes. Yeah, and also it I think they limit the rate of And the answer was that the yellow thing that you're using seems to have a Yeah, they deal or something they seem to Use some sort of api over here. I think it's this one here Yeah Do we have other questions? Is it is it question time? Yeah, we we're at the question period. We saw the tacos Thank you. Thank you for your talk The cdk8, that's not a substitute for kube cuddle At the command line because I saw you using for setting it up. So you still use kube cuddle To communicate and it's just adjusting You still use kube cuddle in yaml files I thought you said before it was a substitute for those but I guess It it you still use them, but it's controlling the help charts and it's controlling the back So, okay. Yeah. Do I have that right? Yeah. Yeah, exactly. So the question was that The if the cube ctl is still relevant While using cdk8 But yeah, yeah, it is because the output of cdk8 is still yml files that you need to To apply to your to your cluster, but if you use cdk8 alongside cdk Which the aws cdk? Those those two are interoperable You can invoke cdk8 charts from cdk and they will execute So you don't need a cube ctl a cube cuddle and on that on that scenario But for the rest of the world, we we just have to deal with it We had some more questions Yeah, I was just curious what gpu you have in there gpu that's uh Uh 30s. Yeah gtx 30 70 Oh, yeah, actually, yeah Yeah Yeah, yeah, we we we were looking into that, but we couldn't get one in in time. So so, uh We do we have you said you had a um like a walkthrough tutorial or something. Yeah. Yeah. Oh, actually, I haven't I haven't show you the last Sorry for this Say so, yeah, yeah, you can check out the repository over there and and you know give us a follow on arkas nexus There we go. Now we got a lot of phones up taken pictures any other questions Up over here. Yeah Yes Yep So my question is um, have you actually considered using operators to kind of automate some of the things that you just did today You could just like run them in the cluster and just feed that, you know models into them and get the results Run operators Kubernetes operators. Oh, no, actually. I haven't looked into that. But yeah, that makes it faster You can natively run them. Yeah Uh over here in the front I'm taking a long way around Getting my steps in So it looks like a gpu operator is doing a bunch of the heavy lifting for the invidia setup, right? Yep So you're on a single node box because you're just running on your workstation. Yeah. Yeah. Does it auto detect which nodes have Yep, invidia and Do you do something in your deployment to say which nodes to schedule to yep? That's a good point I can show you right over here. We have see this this We have some files for cluster definition. This is specific for kind But you would have to do something similar for either bare bones Kubernetes or any any other cluster provider, but uh, yeah, we we are telling me Here creating those files in in the amount mounting rather into the the containers. So so those Uh, are the ones that that allow the the nodes to talk to the to the gpu All right, big round of applause for erin and san Thank you I'm sure they'll be happy to talk to you for a few minutes at the front or in the hall or Yep. Thank you very much everyone on the way to dinner That's that's no No, so let me try this one here That's that's can you hear me? Can you hear me? Is the sound yeah, good Perfect Yeah Any scale first timers in here? one two For some more that I couldn't count fast enough What about um, so cal locals? Who who's traveled the farthest to be here? We got a texas what? France, germany. I think that beats texas Sorry texas I have a cousin who's french My grandparents met during world war two and my grandmother came back from france I sent a letter to a 20 year old address in france and got an answer back and went and visited my family in veer st paul Anybody coming to game night? two minute warning Okay, it's five o'clock on saturday in pasadena This is optimal approaches for real-time machine learning Warm welcome, please for haishim kinesh Thank you Switching microphones, I think can you hear me at the back? Perfect. Okay. Yeah, so thank you for joining This talk saturday Five five p.m. It's first time in my life. I have been around for a long time Doing a lot of conferences first time i'm doing a given a talk at five p.m. Especially here, I guess in california anyway, so my name is haisham from france that I said and I uh I want to talk about you know, so the original text at the original title Sorry, it was not all this like, you know, shiny technical words. It was how to treat coronavirus fatigue and burnout By cycling to the office and you will see, you know, you will see how things Will lead us to that. So, I mean, you know the official thing is how to use Spark on kubernetes apache spark, which is another Foundation, we have a lot of things happening here around linux and you know linden linux foundation But the other foundation that manages a lot of software Apache one so how to run spark on kubernetes specifically for machine learning and When you want to run real-time machine learning Okay, so a few words about me. So i'm a open source project architect with insta cluster We have a boot. Please come visit us if you want to learn more about insta cluster. We are part of netapp So I was you know previous jobs in my life previous lives and I have had a lot of kind of data engineering slash a lot of Machine learning but from the platform and tooling side of things. I'm not a data science expert There are a few data science experts in the room If you kind of but don't challenge me about models and stuff like this I do understand but this is not my main topic is how to put all this stuff into production and build Toolings and platforms around this Okay, just a quick disclaimer here It's important that to kind of iterate. There is no contribution to any open source You know, this is like a project that we have been doing internally and with a broader community So it's not there is no Hopefully we will get there maybe to contribute to some of the open source project we're using Yeah, my vision of things is necessarily biased. I'm talking about these two technologies by the way Which has like, you know, I work for a company. We offer a managed services for spark and on kubernetes So I am I'll be biased and challenge me Please if you think there is there was something that is was too biased And the the main thing is this work is based on principle of open source Whatever open source is open data We use a lot of open data and the culture of knowledge sharing whether it's with researchers or engineers or volunteers of any kind And yeah human learning is a lifelong Work in progress It's my slogan. It's like, you know, everything we do can evolve and we keep learning The agenda for today is pretty packed So I'm trying to kind of, you know, go as fast as possible To cover everything and then again, I'll be around if you want to ask any question There will be no demos. The last one was a fiasco So I'll put everything out there if you want to, you know, have a look at the stuff we're doing We start just putting notebooks and we're putting things together. Yeah We're not mentioned the team, but our people from Australia a lot of people in France And people in guys and girls in the US working on this for now We wait to get until get all the things together So I didn't mention the team, but this is it takes a village to build something we're trying to build here Okay, so what is the the story here? After covet burnout Something I decided to do cycling and only cycling to the office. Okay So it took me Like an hour to cross Paris east west to to get to a net up office And I start looking and discovering all this stuff and open street map is really amazing And there is a vibrant community of of cycling You know like cyclists sorry behind behind a lot of things happening there But you know start cycling was new for me I started struggling a bit to get To get to work on like, you know, like having the same kind of ETA having the same time a loss of struggle to get there So I wanted to first thing first like avoid traffic With a couple of guys do it like we were cycling to the office and but all this stuff is static You know just just provide you go from point a and b and provide you for different kind of, you know, maybe two three Not recommendation, but roots or pads So so like, you know, how can I get there? Maybe by avoiding traffic because traffic affects cyclists cyclists and obviously pedestrian also So I went to I don't want to mention or name it I went to this app that we all use and they say maybe this is pretty amazing You know, like maybe if I use it I can avoid traffic and they end up in parking lots Yeah, there were a lot of accidents and closed roads So, yeah, so the guess was maybe yes, definitely. It's not the right app We can use other apps and stuff like this and I said maybe it's Paris is not the capital. I mean, it's not a city known for cycling So, uh, Trivia question here. What is the capital of cycling in Europe? Yeah, Amsterdam, great. So I said, maybe I can check what's happening there in Amsterdam But the result was the same and these guys, I mean, who girls they have exactly the same question What's wrong with the algorithm of people, you know, what what's wrong with it? Why it cannot show, you know, things that are smarter like, you know, than this And then if you look here and you all understand French Yes, you guessed already No, I'll help you with this. So all the paths or all whether recommendations are not based on Real time or real condition. They don't take in account anything. So, I mean I'm not doing any bashing or anything this up here. I mean, this this is the best that you can have actually So far if you want to go in urban in cities from one point to another So this has really like great features and you get a lot of information But I was expecting as a dead guy that this kind of things like, you know are easily Like, you know, I say like easily can find a solution for this Okay, we'll push the list the things a little bit, you know More than this coming from cobit. We have all this conscience around, you know, being healthy being more like, you know avoid Get diseases and disease and stuff like this. So Why you're cycling why doing effort? It's really bad. Like, you know, you can think about all higher concentration of pollution that arise near like, you know road and networks So us thinking how can I get there? How can we get there? And maybe try to get the least air pollution exposure routes, you know, so Have the least traffic Have, you know, let's be ambitious and maybe it's an utopia and have everything in one place that give us like different criteria or kind of Sorry, but based on different criteria And give us the best route From one point to another So unfortunately same problem. There was nothing out there There was he says some kind of, you know, work Mainly in academia that started mainly in europe Then there was it was taken there was a lot of work happening in asia I think it's a really a big issue In india and in china places like this, you know about air pollution, but you see that there was like a like At least people start thinking and sharing and like a lot of research articles around how we can adjust How we can predict how we can use, you know, at least machine learning at that point like graph normal network approaches to predict this kind of things and You you I mean you can see where I'm going here You can do this stuff only if you have real-time capabilities the Everything that happened on like, you know, like where you are traveling where you are cycling where you are walking to work Really that's going to happen. It's going to affect your route in terms of pollution and And exposure to put event So, yeah, we also kind of try to check in different areas where there was something equivalent But we didn't get you know, we didn't find anything Relevant to reuse at least as a open source okay, so Just one definition here at least in our context when we say real time So real time means that it's near real time Okay, we can discuss it here we we kind of you know Care more about Getting the right information rather than the speed of getting it So, you know seconds are tolerable like we are not going sub seconds to generate but so This is like the kind of target and real time Learning real time machine learning. Sorry in this case is just the application of machine learning The usual machine learning models and approaches To make decision in real time and based on like, you know Changing environment and based on Information that you gather there is an accident somewhere There is I don't know like, you know a peak in pollution because of traffic. There is an event There is a game tonight in LA by the way There will be probably a lot of effect and we want to capture this stuff We want to get all this in one place to you know, uh, get the the most relevant Prediction Okay So from uh thinking from whiteboarding to To building something so here is the kind of you know, open source stock again You can challenge these choices. There's probably 10s 10 others that you can choose And it will be legit We uh tried started with this building blocks like, you know They were common. There were a lot of help a community and we had a lot of skill set Around at least in the company. So we have contributors to this open source So we can tweak and twist the knobs If we want to kind of tackle challenging stuff So at the building block and this is what I'm talking about is You know Kubernetes everything runs on Kubernetes And there are some exceptions, but in this case everything runs on on Kubernetes and the key thing there, I mean at least spark and spark structure seeming Is the main this like the building block of our processing and our training And you know like the whole kind of Brain of the of the platform You could see things that uh, honestly, I've I have learned on the fly like, you know, like feature store I let at least I know worked in person aspect like this But an online feature feature store was pretty amazing, you know things like that act between a bridge between like, you know Um Data and machine learning there were a lot of key aspects that I learned But let's focus here on the kind of the the main thing, which is the infrastructure side Okay, um so What is the biggest challenges at least that we faced or Like like a lot of data data in a more broad or general Meaning data teams are facing when they try to transition So I try to tackle real much real time machine learning So I had a lot of inspiration and interesting discussion with ship She has been doing an amazing job and in this field She has an amazing website that you can read. She did a lot of interviews She went to a lot of tech companies in the us and china And she collected a lot of information around challenges when when we deal with real-time machine learning She has a of course, I will do some publicity for her an amazing book If you want to learn more about Designing machine learning system and there is a whole chapter about Real-time machine learning there's another mention also for to know our for practical ml ops And it is and I think another book coming so we kind of Have Had two different challenges the one from the machine learning or like say the data science side of things It's about modeling. It's about how do we extract features about what is the cutting edge model using You know like attention neural graph networks And stuff like this to train models using geo spatial time series You know and name it like a lot of stuff These are like I can say the blue stuff And the other side is the infrastructure or the tooling stuff And I I cannot agree more with chip and other experts on the field the problem The challenges are from the infrastructure side. I'm not saying I'm not minimizing Of course the the work that is happening from this side But if you look at it guys the problem is from this side here We don't have any blueprint. We don't have any kind of People sharing knowledge on how to stitch things together how to build things together It's it's still kind of new To to to have something common That is adopted and kind of stress tested at least in the community So yeah, these are the challenges and among them Since we don't have enough time We kind of focused on the scalability This was a key Long running. I mean never ending jobs collecting sensing information forever You have peaks you have whenever we have scalability sometimes, of course We are we have to think about cost. So we have to scale down or reduce the Like the the workers or you reduce all the infrastructure as soon as possible latency, of course is key I hear when I mentioned latency in the in the in this talk I I mean mainly latency about the processing. I will not mention latency about serving which is another You know challenging Area monitoring monitoring again is key I'll try to explain why and the distributed training And inference which I just said that I'm gonna kind of avoid for this talk Doug we try to cover here Okay, so In real time What is the most critical aspect is to have the the freshness Of the features where the features is this data the there as we said the serving latency the and the the possibility to have You know the accessibility of data and these pipelines Always running in a in a robust in a reliable way This is a kind of key definition or a building block. We want to keep in mind. We're building this work And Yeah, so spark and Kubernetes to the the streaming aspect of it We could manage string processing as I said training scalability and resource Efficiency which was key, especially when we start to dealing with deep learning stuff and gpus And we start burning a lot of money and you know companies say, you know, different companies who work for say like, you know, you have to be Be careful with your GPU bill at the end of the month Okay It's customary to ask question just to check that you are still alive, but it seems like, you know, you're doing okay But it's really legit. I want to ask you who is using spark on kubernetes Okay, no one That's a good transition in that case I will kind of go back to the genesis not not genesis of spark but Why we moved from spark spark from From the hadoupe ecosystem or at least yarn to kubernetes So Apache spark is the number one analytic analytics engine for big data and ai This is as objective as I can say The reasons are it has two green stars So far or still it's versatile It can do etl el t real-time machine learning bi That's really key. We don't want to choose just use one tool to do just real-time I mean you you you you can see what i'm going with this It's easy. This is subjective, but You're you're bringing someone from java background python sql. They can interact with spark The distributed aspect of all this big data is kind of hidden under the hood We have a lot of framework that you don't have to think about how your workloads are distributed How your python code even is distributed you just write code And it managed the distribute aspect for you And it's fast or at least it was fast This is what it has like start to be a gray area and we can discuss it But it was fast until like recent framework to do like massively Paralysable Read and write and processing. Okay, so this is spark. It was born in that ecosystem that era of hadoop You know hadoop is still around we don't cannot say Still like you know the king is dead the long live the king It's still hadoop is still still around and it's gonna It's gonna stick around so kubernetes is the new framework But I think you know as i'm saying hadoop will be still around so in spark Spark can do all this processing, but it has it needs a cluster what we call a cluster manager So cluster manager does all the orchestration of the jobs And all like just in a nutshell it knows where to put a processing unit in which worker worker is just a vm or Like a physical physical machine just to kind of simplify things here so It went through different resource managers all from the hadoop ecosystem era There was this standalone built built in like functionality, but really met it, but we had Mesos that has gone already and yarn which is still the most widely used And kubernetes is the like the most popular or the last cool kid in the block So we're really happy that this was a reality for spark users It took an amazing team and a team I mean a community of people to do this And I like to to say that from the apache Foundation side and from the linux foundation and the cncf So work together to make this like as a reality So like this is the history of of the spark on kubernetes spark on kubernetes. Sorry So it's kind of recent it went ga in 2021 okay, so there are still a lot of work to do And as for the last version we have like, you know, we still have a lot of major enhancement Happening in spark on kubernetes So just to give you a sense on what was the limitation of spark on yarn The key thing is I mean the key problems like you had this global spark version You can see like now now retrospectively. We know how it's doing. It's definitely good better on kubernetes, but This problem with isolation a limited I put a star there, but it was really limit limited docker image support and a lot of resource overhead And this is just like you give you an idea how yarn managers the yarn managers manages the cluster for spark As for kubernetes, of course, you know, you have this native Dockerization you have like simpler dependency management. This is really great when we we train machine learning model You have reliable execution. We can do isolation, you know, like, you know, when you have container containers And of course you get the benefit. I mean, it's a standard agnostic cloud infrastructure. Yeah, that's a really important point not not to to kind of miss it And one of the key things I have noticed that is this rich ecosystem for kubernetes that spark users like us You know can get, you know advantage of That's a that's a really cool thing There is two ways to run spark on kubernetes. That is the old-fashioned spark submit Okay, if you're familiar with this and the other one is we are lucky and there is a spark on kubernetes operator Really proud that we as you know, the company I belong to we contribute and when maintain this spark on kubernetes Covenant operator it was open source by google, but actually we can it's it's you can use it on any platform It's really open source and it's config like, you know, yaml style with sugar on top I say with config maps volume Affinities things and we have a concept that we created locally in our words called vngs virtual network and node group. Sorry so, yeah This is the spark operator and as I said like, you know, it's it's also a really kind of active project We are having a lot of support and I kind of you know insist on this. It's really important that If you are planning your journey or for at some point, you know, you want to test spark on kubernetes Please don't use spark submit you will hate it and because you know, it's bad Use a spark on operator operator and you will adopt it and yeah I am asking you like if you're starting this journey if you're considering Moving to to spark and spark especially spark on kubernetes if you are running kubernetes already It's easy just to deploy, you know, like a spark cluster and try try to start testing it And this is not just an engineering, you know to kind of migration to the next tool It's it's really kind of has some distinct advantages. I was showing you and If you look at it, this is a like this was from aws just the this this small kind of Cut or cost reduction for some of our customers It's represent millions of dollars per year. So it's it's really huge I mean in terms of cost in terms of management in terms of a workforce like an army of engineers you need to maintain Something like spark submit or spark on kubernetes So that's really kind of it's justified and the whole the whole industry is Going to I mean this this way Okay, just some challenges uh of running spark on kubernetes and offer solutions So as we said, we're trying to uh talk about monitoring scalability latency and model training This is the iceberg and don't be scared that there's no titanic in the story here We are kind of definitely in a situation where there is a lot of logs logs logs log and logs Everywhere. So I would recommend that you start thinking about this if again you Embrace the journey start thinking about your monitoring your login Monitoring before anything else like, you know, even if you are doing dev If you knew if you are doing experimentation Try to have this building blocks in place because as I was telling you there are events from spark Of course the driver the executor as you said, it's a cluster maybe sometimes hundred Thousands of nodes in some of the cases there are kubernetes logs and it's really hard to reconcile All this together in wall pace so again an amazing ecosystem and Fluent bit is something that really helped us a lot and Prometheus which has a built-in spark has a built-in Prometheus in it. So this is something that I could recommend again There is a consensus in the community about what to use but this is a building block and this kind of, you know ash where become more controllable more easy to to kind of, you know Monitor and log one of the key things here also is that You When you do this monitoring What we discovered is that we learned a lot of things That we were not expecting from me not monitoring We learned things about like, you know, or executors of over provisioning We learned also about like partitions issue, which is like a common disease with spark And a lot of things that we learned through our monitoring system So I really recommend that you kind of shift the paradigm It's not just like, you know to kind of firefight if things happen But it's also to help you to be more proactive Because the the complexity of such system you need to have like, you know, this This productivity if you have machine learning real-time machine learning jobs running in spark and cobalances Scalability so scalability there are three key factors here to consider Cluster sizing infrat choices and specs when I say like spec the kind of VMs or the kind of if we are EWS the type type of easy instance easy to instance you want to choose Dynamic allocation Try to say a few words about this and the shuffle data So shuffle is just like if in a nutshell here, if you are not familiar with spark Shuffle is all the data since we distribute the data on on each workers Shuffle is the data that we kind of transfer from one worker to another to you know through the network So this is something really important than the kind of symptomatic in spark So there's this three Three kind of building block sorry that we need to consider here. So for sizing it's important to Just there is no Automatic way to do this. It's just common sense and don't forget what you know, like best practices from From software engineering It's just a continuous and repeated exercise You know, you need to know at some point your data sources so the clusters that you need to put in place and the type of instances need to be like, you know Enhance your parallelism in in relation of the sources So parallelism means that you want to treat in parallel different events or different data coming and your resources your computer source needs to be like Kind of geared and configured to respond to this kind of Data you're having in it's streaming So streaming is cpu bounded cpu bounded in this case means that we spend a lot of Time doing cpu. It's not like an input output. We don't do a lot of reading and writing But you have to be careful in this case because we In our modeling we went to we we do a lot of stateful We do a lot of aggregation and stuff like this. So we have to choose resources that are You know, like cpu optimized, but we have to think about memory because we keep a lot of state, especially in streaming state event in memory Deep learning. So as I was telling you, we used a lot of gpu at least in the training phase You have to be careful also not to burn a lot of gpu on this. We have a mix of cpu and gpu on our Kubernetes we have was as I was telling you what we call virtual node groups We we create affinity and tint and it helps us to kind of make sure that only the The relevant workloads go to gpu's Yeah, and we actually just to finish this it's a mix of cpu and gpu and we make sure that we spend a lot of time on cpu rather than gpu's when we're doing with this kind of things Okay dynamic allocation. Maybe since we uh, we are not all kind of this is like a really advanced topic into Spark, but what it does just in a nutshell So it's kind of do an auto scaling like the same thing We have an auto scaler on on Kubernetes, but it's tend to do it in a smart way So for everything bot it works perfectly and it's really great feature But when it comes to Structure streaming it becomes a little bit tricky and it works really poorly for certain application So and uh, I mean you're familiar with you know this gira tickets and I have been watching this ticket I think for the last four years. It has been open I mean, we still have a vibrant and like, you know, a lot of committers But just just to show you the complexity We actually there is no agreement on this. How do we Dynamically allocate resources for streaming Structure streaming jobs So this is independently independent from Kubernetes. It's just a spark in spark problem So when you run spark on Kubernetes, it adds more it adds more complexity and this feature May cause a lot of issues So we have to be really careful with this when if you want to use dynamic allocation And don't trust it's there's no kind of autopilot in this I would say the the another thing is the the shuffle problem. It's which is like somehow consequence of the Problem we're having with the dynamic allocation, but not only so this external shuffle topic has been around for A long time and we have been waiting. You see that some of these future work External shuffle service. Hopefully it's gonna come soon So there are different open source framework that helps you to have this external shuffle service The store your shuffle data One from obviously multiple from AWS so you have one two three from aws One from Intel and another one from IBM and we have been testing Not not to be honest transparent not the last S3 csi driver and the high performance storage But the other one we have been playing and trying to reach out to the The creators and the the maintainers of the project and it's like a A choice a rational choice or educated and say choice here in this case And we are using the IBM spark s3 shuffle But it started from like, you know 3.2 spark like most recent spark version and it helps us to Kind of tackle this problem of the external shuffle whether for badge or for streaming it works perfectly latency again, it's it's This is just a small remark and I would say it's like for the young Generation or maybe just people who are starting in tech or want to learn a few things We used to learn I mean for those like, you know who lost her like me. Will you lose to learn latency? Numbers every programmer should know And it's kind of fascinating that a lot of young Young, I mean, you know, kind of they don't know anything about stuff like this So it's the basics or the fundamentals are really important So we may talk about cutting edge machine learning and like, you know, spark and kubernetes But at the end of the day, it comes through like, you know, knowing your basics And in this case, so latency We we did we spend a lot of time like, you know Doing wide boarding session and as I said, it's just the processing from a Kafka topic Spark stress and streaming to a Kafka topic. This is our kind of highway. This is our key Pattern and Yeah, kind of key things I would recommend to look out to watch out for is the spark configuration So whatever you learn before from your spark configuration kind of you need to tweak this and make sure You kind of master what you are doing Sub-second latency expectation are a challenge and I can guarantee you for this project with the amazing people I have been working with It's really hard to get even though like a lot of People out there say they can reach sub Sub-second latency This is something really hard to get And yeah, it's always it's always like a trade-off between using stateful stateful again It's like when I'm doing streaming and I want to keep a state like I'm doing the aggregation I want to count the number of red cars I get like every time I have a sensor on a highway Just to give you a sense of what's what we are doing and I'm counting the number of red cars that are going through the highway So I I need to keep the state. I do a lot of aggregation I need to keep a state with stateless is why when I don't care really, you know, what's happening It's just like which color which car has the which color has the car that I'm seeing It's just like red or black or whatever. So this is this again, like this is it has to do with design and The the the problem you want to tackle You want to solve Now this is a really important thing to take in consideration And things that we learned is use as much as simple computation Just say when you doing your data transformation or enrichment If you want to do like complex joins and stuff like this You won't succeed in getting anything latency, you know, reasonable latency So be aware of this, uh, like this is a lot of common sense, but it's important to kind of remind that these still Works like pretty good when you doing a dealing with these complex systems Yeah, so this pattern always use a message bus like, you know kafka I would say, you know before and after And in our case, yeah, it's it's kafka spark kafka And we have another pattern which is kafka spark and we have kassandra already this you always use this Rather than we've seen some configuration where like just like, you know Supposedly by removing all these layers or all this middleman We can increase the Improve the latency and reduce the latency, but it doesn't work This way by, you know, removing something like kafka. You lose a lot of functionalities in your in your platform Yeah, I mean rocks db is just a new Feature on on spark and it helps to store all the states Especially when we said like mentioned the stateful streaming pipelines and it helps to Keep all the states and we don't end up in this Gvm out of memory and other issues ML training this is really amazing I mean a lot of ML framework, especially if you are coming from like, you know, I would say the The the mono worker you you just rather running python on a mono machine or a single machine it's kind of Disturbing because a lot of this framework were designed for a single load environment Spark is another as I was describing. It's a cluster when you have a You know, like multiple machine you have to distribute and paralyze your workload So this is something you need to think about before you choose a framework And spark mlib, which is like a native machine learning library in spark is lagging behind Hopefully I'm not in the Apache Spark conference. I will probably get people screaming at me But yeah, yeah, I'm saying that but we should probably contribute to this But it was it's it's kind of, you know, not getting obsolete because I intentionally put this We have things like torch distributor, which is was included recently. There is still life out there But all the modern frameworks are not there So what we did is definitely use, you know, like the major kind of framework transfer flow keras and pytorch and Yeah, it's things like horror board horror board story transfer flow extended and there is this I mean, it's open source. But yeah at some extent There's no one from nvidia in the room No, okay. I'm safe So it's it's open source and it's pretty amazing tool and it's kind of unified for etl And machine learning, which is great. So you can really kind of, you know Using gpus, which is I was asking my friend recently like, you know, I don't understand how gpus can accelerate Things from etl and elt, but it works. It works pretty well So you have like with the spark rapids from nvidia. You have we have like a really nice framework that we used Intensively, it's just like and it's really hard to run outside nvidia, so We said like we are doing a lot of open source and we are kind of, you know, having this philosophy of sharing and you can run it on your any commodity Hardware so this is a bit, you know complicated I'd say at this point But yeah, we have using these frameworks and they are pretty easy and we kind of publish a lot of docker image Out there where we package everything together A lot of companies are using not our product, but at least our docker images to do a lot of stuff with especially the nvidia As the nvidia spark Accelerate rapids Okay, there is another thing. I mean, you know Sparkling component is great But for batch because anyway, even if we do in machine learning real-time machine learning at some point you will have to do Some batch and the default scheduler the default component schedulers kind of, you know, lacks a lot of What we call like specific scheduling strategies and So, yeah, it's kind of End up doing a lot of starvation for jobs And the deteriorate the the performance So the solution for this there are two major product a project story Volcano and Apache unicorn Volcano is in the scncf and it's growing and it's there is a like a vibrant community that is contributing to this And it's really good. It's as a custom scheduler. We can do like job priority preemption We can do a zone awareness because we can discuss this also like latency matters when you want to Locate your workloads in the same kind of data center You know, so we can do a lot of things around especially with volcano things like, you know, as I said tasks topology and job priority and gang scheduling Okay for the deployment, this is really something that it's it's kind of There is no we are and we have no Kind of guarantees yet. We didn't benchmark anything but like just as this mlops.org Deploy your model. Yes. Deploy a package your model as a docker image With all its dependencies and push it. We are using, you know, kubeflow and kserve to do this But this is not easy. You end up with huge docker images So we are still thinking about model optimization and model compression. They're really important And we are experimenting with on annex runtime, which is an open source from microsoft I don't think it belongs to any foundation But it's still maintained by microsoft just to manage all the dependencies and yeah So with all this We achieved A platform that I was showing and especially that Criteria around latency around scalability and having a monitor platform in order to You know build or tackle this real-time machine learning So I will try to conclude quickly. So yeah, try to convince you again if you are Starting your journey if you want to experiment at some point Spark on Kubernetes Is the thing to go to native native integration Yeah, you have all the best practices and best things that apply to in a kubernetes Apply to spark for free We kind of could achieve scalability latency and other things like full tolerance model training and serving with some caveat with the serving And integration with the ecosystem Okay tk key takeaways as I said use the operator if there are any contributors, please contribute Monitoring is really important Use the best practices from spark compatible with Kubernetes the rich ecosystem And contribute to the OSS There was a talk on Thursday saying that open source is not only code and open source I can confirm is not only code. So share everything you have there share your experiences Write a blog on medium and say like I struggle to do what this guy was telling for an hour It doesn't make any sense and like, you know Share your code ideas on github's Github sorry, share your challenges. This will help the community and this will help, you know grow and Try to solve real life problems because remember my problem was to Be healthy was to stop using my car. Maybe solve the planet if I can save the planet. Sorry And technology can today like, you know recommend the best Movie or the best food or like in real time The next food the next movie I have to watch but not like, you know, as I said Solving real life problem that maybe will affect all of us Then that's all for me keep cycling when you can that's all for me. All right. Are we ready for questions? Hi, uh, great talk On one of your slides, you had Kafka going to spark going to Kafka again with rocks to be in the middle Why not just use k tables or ksql? Why add spark if you're already in a Kafka infrastructure? so because I mean this uh We have this common pattern and I should have just to fight this so the reason is that we Kind of ride from topics that open data come in real time So we need a topic in Kafka and I guess you Why do we have to put spark streaming in the middle because we do a lot of complex transformation that We need something like spark that is distributed that is robust And we do a lot of parallel joins from different streams And then we'll ride back to another topic and this topic is used for feature Extraction did I answer your question? That's what that's what ksql is for I mean again, uh, this is something we should probably consider and remove Stuff like this, but we definitely needed at some point something that can scale and be highly I would say distributed to do this kind of things. I'll yeah I'll Definitely have a look at ksql Thank you more questions one up here Regarding monitoring the spark structure seeming job like there are two ways you can monitor like writing a listener Like async listener or a sync listener So, but even if you write a asynchronous listener, you still get the logs like only only the logs, but One of the struggles we are getting like how do you Get inside of the log you've got to go to prometheus or gaffanas. Have you tried that kind of stuff or so? Yeah, that's a good point actually we we use prometheus directly on the workers to collect the logs and we tend to Kind of you know reduce all the noise We target this is a learning process. We target few metrics that matters for us like, you know We make sure that you know Metrics a b are there and we kind of on the fly we do this process and we reduce We we kind of parse and reduce all the logs. We want to collect from the workers And we do this with prometheus Yes, yeah, yeah, we use these tools but we kind of there is some Intelligence there is some work to make sure we are not having everything And we need to parse it and then you know kind of process it We just collect from the source all the only the metrics that we are looking for This will kind of add some latency because this is a process. This is a thread running So it's awesome overhead, but it helps us to have like, you know, as you're saying like Collect the right information, especially when we do a streaming Did you find an optimal bicycle route? Not yet. So we're still working on this not yet It's a still struggle. So yeah, hopefully I'm still cycling but You know It's it's as I said, like sometimes it's more like human learning than machine learning Maybe I learn more how to avoid traffic than any system can help me too All right, great. Thank you round of applause. Thank you very much. Hello Good Perfect Yeah, hello, hello, hello, hello, hello. Yeah, it's okay, right? Just don't look at the wall over there Yeah, perfect How many of you this is your first talk of the day Nobody okay good because it's also the last one of the day We'll give it about one more minute Who traveled the farthest to be here today? Everybody's so Cal locals I didn't catch it Panama, were you in a talk earlier today? Okay Panama anybody beat Panama? All right, Panama it is that's the winner for this talk Okay, it's 615. This is the last talk of the day Saturday. This is scale 21 We have discovering business insights with open source machine learning Round of applause, please for Syria. Syria take it away. Thank you very much First of all, I would like to thank all of you for making it to the last talk of the day I feel very appreciate of you all that you guys make it to this talk and So to begin with today, I'm going to talk about discovering business insights with open source machine learning tools It's all about predicting customer behavior for your for the growth of your business One thing I want to say is in this age of LLM foundation models generative AI Whatever you call Most of the people are trying to solve their use case using LLMs Generative AI and all like large foundation models It's true that they are efficient They are effective in some way, but They come up with their own set of cost And that constitutes a whole talk in itself for some other time Today I'm taking a step back and exploring some simpler models simpler time series forecasting models Which can solve some useful use cases for businesses I would like to give a shout out to my colleague Hema Viradi who is a data scientist at red hat and Who who was the equal contributor to this project as well? To introduce myself i'm Surya Prakash Pathak a data scientist at red hat I'm based in Berkeley, California And you can reach out to me in my email github linkedin Whatever works happy to be Connect connecting with you all Yeah, so into today's agenda. I'll be talking about Why customer growth is so important for anything for business? And I'll go through some data sources time series models model insights And talk briefly about one more model monitoring tool And I'll wrap this up with the demo So this project was started actually The main goal of this project was to predict customer behavior Over time for the red hat internal products It's like predicting customer potential For different customers for different red hat products Although the data was private. I cannot share it over here But the model is open source And it's adaptive in nature So when I started this by talking about business insights and customers I can adapt this model to any other organization Let's say you have a github And you have a contributors to some projects And you want to track your contributors over time and you want to track you want to Identify potential contributors for your github projects You can also use this model. You can use this kind of framework For this particular talk, we have extracted data or generated data using open AI credentials Just for that That's all LLM we have used nothing more After that, we employed some time series forecasting model We got some forecasted data And using those forecasted data We classified our customer base Or for other use cases say Contributor base into High, medium and low Potential zones Now one of the reason why this predicting of customer potential is very very important is It gives you Advantage on which direction to drive your business through For example, let's talk in terms of github projects or contributors If you know What if you can predict your customer behavior over time It helps you to invest on your it helps you to manage your investment on your projects It helps you to manage your vacations as well So manage your resources manage your investment And vacations for your projects If you know what your contributor is when your contributor is active or when your contributor is not active Or if you can determine These are the high potential contributors and I should not lose them You can reach out to them give them offers or give them good services so that they be in your organization Or track low contributors seeking help You can give them help and suggestions so that they move from lower set of low potential zone to the high potential zones So in this way you can manage your business or any kind of projects So for time series modeling we are using arima To predict customer behavior over time Basically, how much a customer spend every month? We are predicting with respect to arima time series forecasting models So let's move on to the data sets that we have used so far. This is the simplest Ideal data set that we have used The data set is generated using llms So here we have different accounts. So one of the Important thing that you need to understand about this kind of data set is it's not one time series data set It's a multiple time series data sets. Say for example, you have 50 customers 50 contributors or 100 200 You have 200 number of time series and for each time series you need to make a predictions and out of that predictions You need to classify those customers based on their predicted data So here we have different account name account name one account name 11 Different dates in this case. It's varying with respect to every week You have the actual value you can correlate that with the contributor activity over time You can correlate this with the revenue over time And this is the predicted value from the arima time series forecasting models And this is the mape score Now what is the mape score mape? It's basically median or mean absolute percentage error It's the percentage error between the actual and the predicted value Now this is the good set of metrics for evaluating time series forecasting models If the mape value if the actual and the predicted values are equal then mape is like zero percent Very very accurate, which is an ideal case If the more the mape is the less accurate your model is Generally it is considered if it is from zero to 10 Percentage the mape is generally considered accurate If it's from 10 to 20 or 30, it's it's considered moderately accurate But if something above that it's generally considered A less of an accurate so you need to Change your model add up features do some data preprocessing and so on and again train your model and again predict it So that's how you Go into the cycle of like making your mape as accurate as possible So mape is a good Metrics to look into to check whether your model is working or not So we talked about the data set And now we let's talk about a time series forecasting models So here i'm highlighting two basic traditional time series forecasting models these which are open sourced And time series forecasting model are basically Those kind of models which takes up the data points from the past and based on that past trend it predicts the future You have arima, which is the set of statistical models That caters to the group of standard structure of time series data Arima it's it's based on the underlying correlations One interesting thing about arima is for every kind of time series data You need to define the hyper parameters to run the model Based on auto correlations and Based on auto correlations of the past values And moving averages of the past values So for profit which is similar to Somewhat similar to arima, but it's more of an additive model and tries to detect change point It's more like a curve fitting model. It's developed by meta. It's an open source model. You can use it So one good thing about profit is it's easy to use with default hyper parameters So you don't need to define hyper parameter at each run. It's automatic And it's So this is based on a personal experience I felt that profit works much better when your data is seasonal So if every season your data is like have some cycle of Cases then profit works pretty well arima You need to define your own hyper parameters, but it worked pretty well for us because we were dealing with Number of time series like for example 50 time series data 100 time series data so In order to run for all the time series data We chose arima Now I said I told you we have multiple number of accounts and multiple number of time series But for arima the catch is how to define hyper parameters for each one of them now to That is resolved so easily using the auto arima package with auto arima It helps you to determine the hyper parameter automatically based on the individual time series data So here in this curve you can see the training points given by the blue lines The orange point is the test set And the green line green dashed line at the left hand top corner is the forecasted points So what? How we run our model is we train our model And we test we keep some data for the test set And see through mape if our model working well or not So here in this plot we calculate the mape of 5.1 percentage which is pretty accurate for our use cases Now I'm again saying this mape is pretty accurate But this also depends upon the kind of use cases that you are in Sometimes your data is so sensitive that even a point one or point two Percentage difference makes a lot of difference in that case you need to have a very very accurate mape But in some cases some flexibility is fine So here we have the training set here we have the testing set and the forecasted set these green points Now based on these green points based on this forecasted data We classify the ultimate customer set into different zones high potential zones low potential zones and medium potential zones If we move on So generally if there is an increasing trend So one of the Ways with which we calculate trend is by calculating the slope of that line So if we if we consider slope as the trendsetter of that line Then for an increasing trend you generally get a higher amount of slope Now if your data points are not changing the value over time Then that is generally falls under the place of medium slope, which is constant slope And if your data points is decreasing over time Over the x-axis then those kind of data points are Then those kind of trend those kind of slopes are generally in the negative side and those are considered as low medium low growth slope So based on these three different classification of slopes We classified our data our customer base Into low growth medium growth and high growth candidates You can do the same thing for your github contributors list You can track your contributors activity over time And based on your contributor activity the forecasted data points Let's say you want to calculate your contributor activity for next month And next month for the four weeks you see an increasing trend then that means those particular customer or contributors Generally fall into the high contributing zone For the next month, that's the catch if I consider If I see any contribution activity with this constant over time Then those kind of contributors can be regarded as the medium Potential contributors. Let's see some of the examples over here This is the case of low potential customer or low potential contributors. We can assume that way as well So in this case we see an up and down phases for this customer or contributor Sometimes it is high. Sometimes it is low But if we forecast if we really see the forecasted green points over here This is the forecast for the next five weeks from the testing phase So which means that for the next five weeks what I'm seeing is this for this contributor They are contributing on the lower side, which means they are not contributing. They will not be contributing enough That's what I'm predicting it now And that's why I've segmented them in the lower category Now we should not make this kind of result as the ultimate truth But this kind of result can be a key indicator to inform the customer success team or those teams which is in In conversation with the customers or conversations with the cost contributors If we see these kind of indicators like for example in this account name 45 It's on the low category. We can inform customer success team that so and so customer or contributors in the lower category zone Is there any problem with them? they can then check with their customer or contributor and Talk about if there is any issue if it is not this can just be a false positive signal Similarly, if we go to the another example, which shows you the high category of customer This customer or this contributor shows an increasing trend over the period of time And for the next five weeks also it is it is seen as an increasing trend And this also calls for a key indicator or calls for further investigation from the customer success team Which means that if somebody is performing really well if some contributor is performing well if some Customer is spending well on your product. You can give them some offers. You can give them some perks So that they can stay there. They can perform much more better way so I think the main aim is to Keep the customer or contributor intact in your projects or intact in your business So these are the ways you which which you can track those contributors or customers And just start the conversation if there is any problem. So these kind of plot Can be a first hand key indication To check with them If there is anything Now I just said this is the forecast for the if I go back a little bit. This is the forecast for the next five weeks There will come a time after next five weeks You will again need to run that model again And after that five weeks again need to run that model so that It will so that you will constantly get the signal of which customers are in the lower zone Which customers are in the higher zone or contributors? So it's not a static type of model where you run once and everything is resolved But it's the it's it's something like a dynamic model where you run at every interval every month every quarter or every six months or a year And reanalyze your contribution. Yes, sir Yes Yes, sir. So when we run our model We check our evaluation metrics again Yes, and that brings me to the next point. So in in let's say if I need to run my model For 12 months in a year. What do I do? I need to check that evaluation metrics value at each point To make sure that my model is working well because what would happen is A concept of data drift or data might change at some point of time which might Have some effect on the model So because of these kind of changes, there might be some change. There might be some feature change There might be some feature missing So all this kind of thing if you are running a model at every instant of time This kind of model monitoring rules will help you to track your model progress So ml flow is an open source platform for managing ml lifecycle including model monitoring ml flow provides a variety of tools for monitor model monitoring You can you can track metrics and logs, whatever metrics you want to track You can log over there and it will it will it will be logged in automatically You can visualize model performances And you can also set up an alert if there is any change in the model performance like for example If you can also set up an alert for your mape values Like if you run it and if you if the mape score is above 10 percent Then just send send me an alert you can also set up those kind of thing with ml flow If we need to go to if so here's the ui So this is the ml flow which was deployed in open shift in our red hat So here's an simplest example of how can we track these kind of models? So this is the experiment customer growth model, which I have named And these are the two runs. So this is the two different runs I have made So this is the first run and here's the metrics the mape value that I have Paste it over here. Like this is what I've calculated. This is what we have got from the model running And if I go back if I go to another run, here's another matrix if which is like 5.3 So that in the in there are two things changed in between the two runs One is the mape score. It's just because it's not the it's not the fault of the model But it's just the I've changed the unit I've changed the unit from 0.5 to percentage. So generally the mape score is from 0 to 1 if the if it is there then 0.5 is Equivalent to 50 or like like 5 percent or something, right? So So just the unit is changed. So in this two cases and Say for example, if you have 10 different runs over here, how how should you visualize these things? You can also visualize by going to By creating some kind of chart. Let's say bar chart So if I add this so here in the first run I have this 5.34 percentage mape In the second run I had like 0.05 mape. So these are different values of mape consider That's for example, you have we have 10 accounts or like 100 runs you can also visualize if if if the mape has surpassed the threshold value or not So these are some of the cases where we can look into through ml flow or like Yeah, just things So now next thing I want to Go through is this repository, which is customer growth model. This is our project repository So this repository whatever results I have shown you today is the code is already over here The two main folders of this repositories are data and notebooks So in the data folder you have the raw data where Which was generated from using llm's And in the process data you have the prediction data set you have the label data set which So I can show you along the way. So if you go to notebook section So first notebook is growth model generate data And if I go over here This is the growth model generate data notebook So here you can see how you utilized open api key to generate a data set tabular data set for Working for this talk for this model So you have you need to supply the api k environment variable You need to give give it a prompt So the kind of prompt I gave is like make sure it's an account number and make sure the date is between January and july Just just random And then I made sure that a couple of accounts have increased amount over time Decreased amount over time and medium account over time so as to I can have some variety to experiment with So this is the kind of prompt I gave it to the model and So I'm looping through each account and making a request to this open ai credentials So with this request and there's some data preprocessing Steps that I go through and ultimately I'll get the data set like this You have the account name you have account one different dates varying weekly And this is the revenue So this is the data that I save in the data folder Once you generate this data set we move on to um model forecasting notebook In the model forecasting notebook you import all those packages for time series forecasting arima package which is from PMD arima And mean absolute percentage error. You already have this package over here. So You have this data set first now generally in a In a real-world situations you you might need to go through a lot of deep tap preprocessing phases Like I just want to give a shout out to the talk that happened earlier today about chaos and um tsunami data tsunami from kali They were talking about all those github data, which is so enormous that it's very difficult To at least like manage and make sure it's there, right? So that is some challenge, which is which we go through and I've skipped that step over here completely I've just shown you what happens after that So in this case to to come to this level of data, we need some data preprocessing We need some feature correlations or we need to understand the feature importance Which feature is important for this model? So we need to select all those things. Those are there So here is one of the Example where we create our own features a feature called tenure. So we didn't have this feature in our data set so how we created our this tenure is we we assume that tenure is something for which we Have a non-zero value If a customer is spending that month then that month is regarded as plus one tenure If the customer is not spending that much that month then that month is regarded as a Zero tenure. So that's how we calculated the tenure and it gets summed over The tables so it can act as a feature for our data set And then we go through so there are two stages over here first is the test train and test And evaluate so once we evaluate our model then since it's the time series We train our model into the whole data set And then forecast it so first phase is The actual value and the predicted value we We predicted it for different accounts account one every accounts are there. So there's like actual value predicted value Once we go through this we calculate the map and here I have highlighted. What's the accurate map? What's the unacceptable map and all? So this is the calculation part of the map So here we have actual predicted and made values and we now save all those results in the data folder again Now once we have those considerable map of like 5 percent, which is pretty accurate for our model And this 5 percent is like the median map that we have used we have We want to see how like for we can see the map at every line, right? How do you calculate the single score for all of them? How do you represent a map with a single score? So to get an average Value we calculate the median. So here we have the model forecasting Algorithm and then we forecasted the data for the future dates not for the current dates So once we have the account name data and the forecasted value Our next step is to take this data forward to segment the customers in two different zones And for segmentation we calculated the slopes of the forecasted value And that that is what we have done over here. So in this case, we have labeled the customer potentials based on different forecasted data So we have the accounts date and the forecasted values over here We calculate the slope here and based on each level of slope We level them and we level in such a way That we defined a threshold for each one of them So if we consider So if we consider different values of slope from minimum to maximum Minimum we take it as zero percentile Somewhere in the medium We take it as 58th percentile and the maximum we take it as a hundred percentile We classify slope in such a way that whichever lies for whichever account has slopes less than 30th percentile We take those accounts as low potential accounts And for those accounts which have the slope that value that lies above 75 percentile Then those kind of accounts are considered as high accounts. This is completely based on Our asm son you can have your own asm son sometimes You you might think that above 60 percentile is a high account So it's it's completely based on your asm son So you it's completely flexible and depending upon your use case scenario what you want you can define it your own threshold So giving the threshold I classified different accounts into high low and medium. So here you have the classification everything So once we have this classification results We can send this results to the customer success team and they can identify high low and medium and they can recheck their Customer recheck with their contributors recheck with their customers In the next case we Just a visualize and perspective we try to plot a couple of accounts Like for example, this is the high account over here as we as we saw earlier also like So this is like the increasing kind of slope The next is the Although it shows over the whole range. It shows somewhat of a constant kind of a slope But for the next five weeks it shows Lower so which means that is this low account for the next five weeks Now the third case over here we can see Although it is also similar to somewhat The above one but for the next five weeks it shows both minimum and maximum like it goes lower and it goes above So so overall for the next five weeks. It shows somewhat of a constant slope So if you take an average of all of them it falls somewhere in the middle It doesn't change. So we take this kind of account as some kind of medium account Where more or less there is a constant slope So once we have this visualizations once we have all these results We can log in the main values in the ml flow For ml flow we have to supply since it's deployed in open shift and we need to Have some credentials to in order to access them We have to supply the ml flow url and ml flow tracking token In the environment variable and once we have that we can Just log in our main values calculated main values over here 5.34 into the ml flow login And if I want to put this into a pipeline So at every round it will go through it once it runs this notebook It will automatically log in the main values in that ml flow segment So this is how the whole processes happens. This is how we can check the model. This is how we can Run the model and Run at every instant of time Having said that I'll go to go back to my slides now. So demo is done So one of the main key takeaways from this talk is It's not just for a business use case. It can be used for any kind of time series use case The main thing that is needed is time series Another thing that is needed is what do you want to achieve The main asking the questions like for example, you have the github projects. You have the github contributors What do you want to see? What do you want to track? For example, in this case, we can consider metrics like interaction frequency How often you interacted with the github member or github contributors Does that interaction interaction helped in the contributors growth or not? So these kind of features say it can be added later And these are some challenges in github. These are definitely some of the challenges in this kind of customer interactions because a lot of features are there which is very hard to extract like if you if you are if you are running an organization in github and you have to Converse with contributors at every now and then then Is there a method to record that kind of conversation as a metric in a table And add this as a feature set in this kind of model And if you can add those things if we can add those features as a feature set It's it's much. I think the map will be much much accurate in that case. So yeah Thank you. And here are my like I think this is the customer growth model github repo and the notebooks to the links to the notebooks are here You can definitely check them out. And if you have any time series, you can apply these models to your own time series And if you have an issue you can just open an issue in this repository and we can we can have a conversation over there as well Thank you very much. And Let me know if you have any questions. Thank you Do we have any questions Hi, thank you for the talk. Um I'm actually really really into time series. Sorry I'm I'm actually really into time series and stuff like that. So I do have some questions Uh, it's actually my expertise in it. Uh, when you compare arima to profit Uh, you mentioned that profit is sees is better at seasonal, but um Arima never had to taking account Seasonal because there's s aroma for it. Yes, exactly. Yeah, and then you also mentioned, um creating features like feature as in into A new feature to add into the time series. Yes Um Is that like aren't like so as far as I know it, um Regular arima doesn't take Features unless you're using arima x. Yeah, so so what you're saying is uh, you're the first question is You're saying that s arima does a seasonality, right? So one thing we are not, uh addressing seasonality over here is because our accounts is multiple so Let's say one account has a seasonal base Another account might not have a seasonal base. So in order to make sure Everything works for all of them like we are running it, right? We cannot run s arima for one and arima for another. So it's a bit tricky out there So just to make the matter simple we we Cut out the seasonal part, but um From the pmd arima package auto arima You can adjust the seasonality over there as well Like if you have a package pmd arima and you can call the function auto arima In the argument section, there's a seasonal part where you can On and off over there. That's the one another thing about uh, your question is What was that like Yeah feature, right? So that's a tricky part in the time series. So Again, um in the pmd arima package As an exogenous variable you can add up the features over there inside the argument itself But the condition over here is the if you are predicting for the future where you don't have any value for the feature or the The target variable It's it's it's a complicated thing, but Sometimes You don't have the values for the target variable, but you might have the future values for the feature variable Right You got it Yeah, so in that case while predicting it you can supply the feature values As an exogenous variable in the pmd arima package in the time series Yeah Yeah, I think I think it looks simple. It's it's like the one direct statement You just put the argument you just fill in the boxes But it most of the things would goes back and I think so Yep, please take a look. Yep Any other questions? All right, I think we had a pretty good grip for this late in the evening. Thank you. Yeah Great. Thank you a round of applause, please. Thank you very much