 And now we will come to the keynote by Claudia Comito. And I'd like to introduce Claudia. Hi, Claudia is working in Yulich. And she is going to tell us about building of the Helmholtz analytics framework and the making of heat. Now, I don't really know a lot about what these things mean. I hope you're gonna tell us. But it sounds very interesting and Yulich is a supercomputing center in Germany. So it definitely has to do a lot with high performance computing, big data, and all these nice things that everyone talks about. So Claudia, welcome to your pysons. Thank you for giving a talk at the keynote. And I think we can now go ahead and take it away. So let's hear the screen. Right, and I'll go off stage now. And I'll stick around, take notes of the questions, and then we can have the Q&A five minutes before the end. All right, thank you. Okay, thanks a lot for the introduction. And thank you so much for inviting me. I really cannot put into words how astonishing it is for me to be talking at Euro Python. My background is actually in astrophysics. I only switched to scientific computing three years ago. The last thing I expected was to be invited to Euro Python. Anyway, I'm glad you don't know much or probably nobody knows much about these projects because yes, I'm going to tell you about it, about them. What's interesting is, and I didn't quite realize at the time, but this project I joined in 2018 was a pretty impressive exercise in getting researchers to talk to each other. And those of you in academia know that it might be difficult to get researchers to talk to each other, even if they're working on similar topics in the same building. So in this case, the goal of the project was to get completely different research groups in different scientific fields to work and eventually adopt a common product. So I'm going to tell you about this. And in general, I think the project was quite successful, but sometimes I realized some of the things that made the project a success could easily have gone completely in the other direction. So I'm going to try and take these things apart a bit and look at the many ways that we were good and some of the ways that we were lucky. Okay, let's get started. Of course, you can't escape this, but interweaving to all of this is of course my personal journey out of fundamental research and into data science and not just like after my PhD or my first postdoc as most researchers do, but at a relatively senior age. So let me say a few words about myself. First of all, I'm originally from Italy. I studied astronomy in Bologna and moved to Germany about 20 years ago to write my PhD thesis on the molecular content of star-forming regions. And well, as things turn out, I never left Germany. So in fact, I've been a German citizen now for over 10 years. Everything considered, I worked in astrophysics for about 18 years. So that's including my PhD. So at first with the high altitude telescopes for the Max Planck Institute for Radiostronomy in Bonn and later with the Herschel Space Observatory data at the University of Cologne. And well, I didn't really, it wasn't very obvious to me at the time, but I think I was already at the time more interested in the technicalities of data acquisition and processing rather than in the analysis itself or in doing the research. So I often found myself in interface positions between technical people, as we used to call them and the astronomers or the users as the technical people called them. And of course, I had been using Python and NumPy for data calibration and data analysis for all of my astronomer career. So I thought in 2017, I thought I was quite good at it. And in hindsight, I am so glad I had no idea how much I didn't know before applying for this position at the Ulish Supercomputing Center. What I didn't know in my last years in fundamental research was that I had been living in a pretty unhealthy combination of bor out and uncertainty, job uncertainty for a bit too long. So, yes, I didn't come to realize the connection between those two things and boredom and uncertainty until later, but it's only recently that I've started to think of projects or tasks in terms of boredom versus perceived safety. But anyway, a good place where everybody tends to be or would like to be is there. And of course, safety and boredom will mean different things to different people. And actually this plot would need a time axis as well. Even the same person going through their different life phases will have very inhomogeneous perception of what it means to be to feel safe. And in my twenties and thirties or early thirties not being bored meant being able to tinker with telescopes on some oxygen-deprived mountain desert and being safe meant taking the oxygen bottle up the stairs even if it weighed a ton. So I think it is a widespread phenomenon at least from my Eurocentric perspective that we tend to associate job safety with boredom when we are younger. But later in my career, I realized that for me, at least for me, boredom on the job was actually inversely correlated to safety. So the less job safety, the more I felt pressure to take on tedious, resource-intensive tasks that I thought would make me indispensable and would give me job safety, but it doesn't work that way. So I guess what I'm trying to say in so many words is if you ever find yourself in that upper left corner on the boredom to safety plot, feel free to start thinking about an exit strategy. Yes, so I was lucky. My exit strategy materialized then as a job offer, as a data analyst at the Eulish Supercomputing Center in late 2017, and then I was hired in 2018 as a generalist kind of technical user representative kind of person within the Helmholtz analytics framework. And you can already tell at the time, I didn't have much idea what I was getting myself into. So I'm not sure how many of you are actually versed in the German publicly funded research landscape, the Helmholtz Association. I'm going to tell you a bit about it. It's the largest research organization in Germany. They're mostly focused on engineering and life sciences. And of course computing, high performance computing, big science, big data is research infrastructure. So it's a glue that keeps it all together. The interesting thing I found is that the Helmholtz Association doesn't fund single research institutes. They focus on cross-center research programs. So large cross-discipline projects are already the norm within the Helmholtz Association. What was new in 2016 was called for proposals for the so-called information and data science incubator that was initiated explicitly to fund outstanding research projects in the general field of scientific big data or big science, as you want to call it. Okay, so the Helmholtz analytics framework was one of the six projects funded in the first incubator round in 2017. And the focus was on tackling big data challenges in applied science and life sciences and providing a generalized toolkit for data analytics. That would be then within this framework co-designed by scientists and developers. So let's see, here's a quick panorama view of the actual use cases. I'm not gonna go into the details simply because I'm not a domain scientist in this project. So I only understand the use cases very superficially, but we have an earth system, we had an earth system science use case. They are working on modeling the interconnection between landmass, water bodies, atmosphere, and like every other use case here, dealing with massive simulation volumes, massive data sets. We had some atmospheric science, so modeling the effect of climate change on stratospheric ozone. We had two neuroscience use cases. The first one was more imaging focus of trying to disentangle the genetics from outside world effects on the human brain. And we had some aerospace or rocket science kind of use case. And this picture represents the aircraft design via high precision simulations. We had structural biology, trying to identify protein structure by machine learning based on very few bits of information. And again, we had so the second neuroscience use case that uses statistical analysis of huge time series of fire ignorance to figure out what series of triggers actually corresponds to actual connections in the brain. Anyway, the problem is common. The data are very large. They're huge. They need high performance computing. They are also using similar methods for dimensionality reduction. So that's where our tool heat comes in. So the idea behind this is now in very simplified terms. Let's put together a few compute heavy research projects from different fields, assign them a single group of scientific developers. They call us information experts. Have to get used to that. So I see if the mix can come up with a general toolkit that can serve first the use cases and then later the academic data science computing community at large. So the buzzword here is co-design. And I'm not going to get into the details of how the several research teams were selected. You can imagine there was some politics involved, but the fact is that as you have seen in the previous slide, the final selection of eight research topics was quite a bit more heterogeneous than anticipated. So all of these use cases had three things in common. They mostly dealt with monolithic multi-dimensional arrays. Their arrays were too large to be loaded in a NumPy and the array. And yes, of course there was agreement on the programming language of the toolkit. Otherwise I wouldn't be here, I guess. But otherwise, some of the use cases only, I mean, quote unquote only needed to be able to perform distributed basic operations. So maybe they have already a NumPy-based tool, for example, that's using mostly or calling mostly linear algebra statistics manipulation operations. And they want to be able to run this tool on memory distributed arrays, which you cannot do with NumPy or yeah, on big cluster or including on GPUs. And most of the use cases adopt common machine learning algorithms for labeling, for example, like K-Mean, Scikit-learn, K-Mean, Scaniers, Neighbors and so on. But some of them needed something a bit more complex like a singular value decomposition, principle component analysis. And I mean more complex now, at least from a parallel implementation point of view. And finally, a subset of use cases, especially those dealing with the medical imaging involve neural networks. So, right, fast forward three years later. And now you know why heat or Python library heat looks a bit like a distributed tensor library in an identity crisis. So there's a basic operation, NumPy-like part. There's a parallel machine learning part that's more like Scikit-learn. And there's a data parallel neural networks part that is more like PyTorch. You can distribute PyTorch neural networks training with. So I'm going to go very briefly into the heat, the basic operations part because it's the foundation, right? And yes, so as I said, heat adheres strictly as possible, at least in the basic operations to part to the NumPy API. The greatest difference, the greatest difference being that when you wrap your data in a heat array and we call it D and D array for distributed and the array, you can specify the split argument which defines the axis along which your array will be distributed across the available processes. So each process will only load a slice of the data. You don't have to worry about single node memory as long as you have enough processes to load your data on. So here's a bit more of a visual representation of what the split argument really means. And locally, so the local data objects are PyTorch tensors. And so process local operations are PyTorch operations and that covers of course process local efficiency. Thank you so much, PyTorch. And Igor execution covers GPU support and differentiability of the arrays. But the main thing with heat is that the D and D array is not treated as a set of separated slices. It's a monolithic entity. It's treated as a monolithic entity by heat operation so that whatever heat function you're calling, of course, if we have implemented it, but we have a lot of them. So whatever function you're calling, it's parallelized under the hood. So the needed communication layer is via MPI and MPI for Py and that's basically what the heat developer's job has been to provide this parallelization under the hood. So I'm not going into more details. You can check out the library on PyPy and GitHub. I wanted to hear, I wanted to talk more about difficulties and the good luck really we had within the general framework. Yes, so the biggest hurdle we encountered was for sure hiring developers for an academic research project. This is really quite difficult. In fact, the DEF team, the heat DEF team is to this day always and permanently understaffed. So I'm sure at least from what I'm hearing from the research, other similar research projects we are in touch with pretty much every scientific development project runs into the same problems. You can hire scientists, turn developers who are usually at least in the beginning of it like slow developers or you can try to hire professional developers but first of all, they're not going to be willing to be underpaid in academia for more than a few months, maybe a year and the second, you do need to then have some intermediate figure or several figures to interface the developers brains with the scientists. And also this is quite anecdotal but my experience is at least in my former community scientists were really highly skilled that's a software development still will shy away from scientific development roles because they feel they might jeopardize their scientific career. And so I guess I'm coming back to job safety. Scientific developers need job safety, job safety of course, like everybody else I would say but this is really a category of scientific generalists that has it really tough in a specialist-dominated research world and unless they switch out of academia really early they also has it quite tough in the or they might have it quite tough in the software development industry because they were not trained as software developers or software engineers. Okay, so anyway, sorry about cats. Within the Helmholtz analytic frameworks we ended up with a pretty heterogeneous group of scientists. So with different levels of experience and different levels of expertise really. In the team we have one computer scientist, two informaticians, one neuroscientists and five physicists but the five physicists including myself are from the most disparate branches of physics so we are really like not the same species I think. I forgot to ask my team about for permission to use their pictures before I left on holiday on a summer break. So I hope you can tolerate the cat emojis to replace our pictures. That's the German part of my brain playing safe with privacy. So these are the core developers plus a few brilliant minds who have contributed less code maybe but many ideas especially at the beginning of the projects. I want to mention Marcus Goetz specifically. He's at the Karlsruhe Institute for Technology. He's been leading the heat development pretty much from the beginning basically just out of his PhD. Really, I cannot speak highly enough of this talented young leader. If you get the chance to work with him, take it. I also want to mention Daniel. Daniel Cochlin who started out in Ulish has now moved to Karlsruhe as well. He's certainly our most prolific developer. He almost single-handedly hammered out the linear algebra module and the data parallel neural networks module among many other things. And then I want to mention Charlie Debus who's in my opinion, one of the most brilliant minds on the team and she's unfortunately one of the most brilliant minds of many teams. So the time she devotes to heat is less and less but you can really thank her for a lot of the parallel machine learning implementation within heat and for a lot of the communication layer the really tough abstract MPI stuff. And then I want to mention Philipp Knechkes and Kai Kreisek. They have contributed maybe less code but the bigger ideas among other things they have been working on MPI for Torch library that I forgot to write here but you can look it up. MPI number four Torch that will eventually be merged into heat and allow for distributed automated differentiation. That's a really big thing. And then we have our continuous integration gods Björn Hagomaya and Michelle Tanava. And Björn has also taken on a lot of the coordinating between the developers, well, most of the coordinating I would say between the developers and use cases. Okay, so what I want to mention here really is is how well the team got around this. I think it was unavoidable. This initial calibration of recalibration of the group dynamics. It's a group in which seniority and experience don't necessarily match. I mean, for me as a senior staff scientist it wasn't that obvious that it would work so well. So we were bound to have culture clashes from coming from so many different communities but really for me coming as basically a senior beginner in parallel computing and you can tell from our code base it's not just me. I mean, most of us aren't professional developers but no matter this is really a group in which it felt always from the beginning it felt okay to make mistakes to, even low level mistakes, basic standard library Python questions when you mix people from so many different backgrounds that's really a good environment to have. And I have to say we also, at least the scientists on the team, we also understand users' needs in a way that I think a team fully stuffed with professional developers probably couldn't. Of course, they would produce the library in one third of the time, but okay. So, well, while preparing this talk I had almost forgotten that roughly one third of this project was run under pandemic conditions. And if I look back at all the progress we made in the last year, it almost looks like for us as a distributed remote team it was sort of in the beginning it kind of felt like business as usual. In the last year and a half our ePaper was accepted that IEEE Big Data 2020 that was a really big thing for us. We went through major code overhaul we kept implementing new features to support the use cases. We kept submitting pavers and presenting heated conferences. We wrapped up the Helmholtz analytic framework this spring quite successfully I think. Eventually we were quite surprisingly we were approached by Intel developers and that started a collaboration with them. I could go on. There were also a few things that didn't work so well. I mean we kind of lost touch a bit with some of the use cases needs and we were too slow to respond to inquiries from new users but overall I think we had a very productive year and a half and of course as I said it helped a lot that at least the heat developers started out as a distributed team that collaborates asynchronously anyway that shies away from unnecessary meetings. I mean we are totally meeting lazy but I don't want to brush the strain of pandemic operations under the rug. In fact I mean I tried to put together a slide that would effectively convey this strain for the last year and a half and I think I failed. This slide does give me a bit of anxiety but not as much as we had really in the last year. We went through the whole spectrum of loneliness to being overcrowded. I have three kids, everybody on a video call at the same time and less interruptions and some of us were worrying about the next career step and so on. It was really quite a strain that I don't want to as I said I don't want to brush it off so easily. I should get my kids Hawaiian shirts. I just realized yesterday while I looked for the where are my pants Lego guy. Anyway what worked was sometimes I think the fact that some of us started out pretty clueless about what we were doing and the landscaping which we were operating. I think that cluelessness was actually one of the reasons why we kept going. I mean speaking for myself I didn't stop to wonder can we really do something different or even better than dusk and cupai and so on. Now I was learning all the time. I was learning new things. I was solving and like everybody else on the team we were solving difficult abstract problems that in the beginning at least we weren't used to. Each of us could find something low bored on high safety to do because the skills on the team were so heterogeneous. So I like to think that we had a very forward looking management in place when the heat developers team was put together. I also kind of know that a lot of it was just was luck or let's say enthusiasm on our side. So anyway we ended up with a toolkit that at least some of our use cases are some of our use cases are actively using for their applications and that allows them to carry out the data analysis that we're just weren't able to do before. And that's especially true for those use cases who are lucky enough to be able to hire dedicated developers for this project. That's for example, I want to mention the Terrestrial Systems Monitoring Use Case where that was the earth system modeling that I mentioned earlier that hired Daniel Coaklin at the beginning of the project. They hired Daniel first and he's probably contributed the most lines of code to the library and a lot of features. And then later they hired Ben Burgart with currently our main source of bug reports. So thank you so much, Ben. Yes, a similar fate was also for the second neuroscience use case who I'm supporting and importing their non-pay tool to heat and they can now run some statistics that they just couldn't do before. So finally, well almost finally, I have two finalists here I think. I want to mention for my point of view, the greatest lucky term, even if it sounds funny to say it like that, as far as I'm concerned, early on in the project, one of us, probably the most meeting lazy, Phillip, suggested to replace our bi-weekly video call with text meetings on our Matamos on our chat. I just cannot overstate the impact that text meetings had on my onboarding and even later, being able to research on discussions that were going on during the meetings, but it just wasn't able to follow in real time. That was just awesome. Even to this day, I go back to discussions that happened two years ago because I now remember other was something, but now I understand it and maybe two years ago, I didn't quite understand all of it. Also, it saves us a lot of time because students joined the developer team for a limited time, but they join all the time and then they leave, they have access to all the information, the previous discussions to links. Really, I think any group having regular meetings and heterogeneous skillset should consider switching to text meetings. And again, this probably known fact if you are in software development, but in academia, it's not that obvious. So, and finally, I want to mention, I already mentioned it earlier, but one major lucky turn was being approached by Intel for a possible collaboration for their one API. And this was recently at the time where when the Helmholtz Analytics framework funding was about to run out and further development, heat further development was, I mean, we knew it was going to go on, but we didn't quite know why. I mean, how, we knew why, we weren't sure how. So of course, having an industry leader interested in what we do makes it easier to secure more public funding. I'm not really into the tiniest details of the collaboration, but they are openly contributing to our repository so you can have a look there. Okay, so now room for improvement. Okay, this is more of a, I mean, if I could do it all over again, thought experiment from the development point of view, I think the main thing that probably slowed us down and that we could have handled better even within our boundary conditions is that the roles within the team were a bit fuzzy. I think it was justified in the beginning because we all had to find our place, but maybe they wouldn't need to be so fuzzy now. So to be sure a lot gets done, but it's mostly on a volunteering basis. So this tends to crystallize and of course, in the sense that developers would take on a task or maybe the implementation of a feature or whatever fixing a bug, then they are sort of the implied default person for that piece of code for the task. And obviously the workload distribution gets really skewed really fast. And also people end up specializing. We don't want that. Of course, I know if you have a software engineering background by now you're probably thinking, oh yeah, these are the basics, right? We learned that in the first semester because I mean, nearly now I want to mention this kind of insight is what Italians call discovering hot water. And this is totally not the same as reinventing the wheel. So if you need a nice new idiom, keep this in mind. However, I think one of the main features of data science projects within research is in fact that they are somewhat depleted of software engineering skills. So for those of you that at some point will find themselves putting together a bunch of scientists to produce software, be prepared, they might be absolutely or vehemently against cram, but at least establishing some kind of rota from the beginning, especially for the more tedious tasks might be a good idea. It's not just about distributing the workload, of course, it's in a small, especially in small teams, you want everybody to be able to do almost everything. And of course, boredom needs to be distributed. I think you see I have a boredom theme here and I also found myself a nice little niche role within the team for that. Okay, within the entire framework, again, if I could do it all over again, the one thing I would definitely change and where I think really we could have done better is I think the developers should have been more proactive in supporting students and early stage domain scientists. We kind of assumed that they were getting in touch with us if they had questions and problems, but I think the hurdle for them to create GitHub issues to then keep pulling at our sleeve to get their issues solved. At that stage of that scientist's career, it was just too high and it was just a low safety, high boredom activity that in the end it didn't happen. So not as much as we expected. So if I could do it again or maybe in the next phase of the project, I would certainly set up some kind of dedicated support channel for students for early career scientists and certainly keep more of an eye on the student timelines because they really have tight deadlines, deadlines. Okay. Ah, yes, I think I'm almost done. So, no knowns, no knowns and no knowns. Let's see, I think I've mentioned our known knowns already. The ongoing work is working on parallel singular value decomposition more complex machine learning algorithms than we have now. So, SVD, principle component analysis and optimization of data parallel neural networks training. And there's ongoing work, as I mentioned, on distributed automatic differentiation. We have this just started ongoing Intel collaboration which we hope will continue. So, no knowns. These are projects that we are kind of trying to get started and mostly deal with dissemination, heat dissemination beyond the Helmholtz analytics framework use cases. Of course, we would like other scientists to try the library as well and to use it if it helps them. So one group we are in touch with recently is our local earth system data exploration in Ulish. It's Martin Schulz and Felix Kleinert. Then we have a possible kind of up in the air collaboration with the universe and matter, big data, big science projects. That's a German call for proposals. It's a pretty big thing where we started to work on. And then of course my fixed idea I mean, that's literally up in the clouds and above is maybe at some point eventually to send heat to space. Okay, of course I'm not a knowns. Who knows, your use case? I mean, if you have something interesting that you want us to work on, get in touch or if we can help you in any way, get in touch. Github would be the best for that. And I would like to thank, well, my management first of all, that's Daniel Malman and Björn Haggemaier from the Federated Systems and Data Division at the Ulish Supercomputing Center. Daniel is the division heads and Björn is a team lead. They hired me in 2017, they gave me safety, they gave me time to figure things out, to learn all the, well, apart at least of the things I needed to learn. I don't think I'm done that yet. Yes, and they gave me hard stuff to do. That was, they really put me in that nice green square where I wanted to be and Markus Goetz also tolerated quite a bit on my cluelessness, especially in the beginning. Very gracefully, very graciously. And then I would like to thank our student contributors, Lena, Ben, the two Simons, Jakob, Luca and Fabriz and I hope I'm not forgetting anybody. That's certainly, I'm thinking more that came in and went. Students are really the better coders in this project. So I really enormously enjoy working directly with them because I learned so much. And of course, they also learn a bit of the scientific perspective. Also, I want to mention, I'm not sure that I could have switched to data science so late in my career, I mean, in my 40s, without Coursera, without the fantastic Python 3 deep dive. That's Fred Baptiste and also a great resource for to understand parallel computing and message passing interface is the EPCC online class that I've linked here. I don't know if I will send in slides later, I suppose. You will have the links there if you're interested. And I think that's it. Thank you so much. Thank you very much, Claudia. That was a very inspiring talk and thank you for all the insights that you gave in terms of how scientific projects are run. And I think that's an area where not many of us can actually relate to because most of us were probably from the software development industry and there, of course, things work a little differently. Also, I think that most of us will probably work in smaller projects, not these big ones, not like between institutes spread across Europe. So that's very interesting. We have lots of questions. I'm gonna just ask a few of those. I'm gonna post the rest into the breakout room and I would like you to join the breakout room after the session. So, one question that got a lot of, feedback was this one. You mentioned text meetings. It's a new concept for me. And what does it actually comprise of? Well, that's surprising because I thought it would be kind of common, especially in software engineering and software development. Well, we have a Matamos. I'm not sure, it's a chat, right? It's like kind of thing. And we have different channels where we discuss. We are in touch pretty much all the time, but we have a dedicated channel that we call optimistically sprint meetings. Although our attempt to scrum has, you know, crashed really fast, very much in the beginning. But anyway, the channel name is SAID. And every two weeks, at a given time, we meet in this specific channel and we all come together and discuss things that are going on. I especially like it when we talk about things, not so much about, you know, this is what I've done. I don't care about that. It's on GitHub, right? What we care about is what problems we are running into where we kind of think we need help or we are getting stuck distributing issues. That's what the boredom master does usually. Distributing issues that have been there for a long time and nobody takes them on or planning the next release, this kind of stuff. So we kind of join in this specific channel. We try to all be there at the same time for maybe half an hour, one hour. And it works really well because the discussion is there to read, right? For people who can't attend or if later you want to go back to it. Right, I can very much relate to that. I'm, you know, very much on your side related to these text meetings because I'm having too many in-person meetings or, you know, like nowadays virtual meetings and they take so much time. So not very often. Absolutely. Next question is this one. Do you have any, do you have some ideas on how to best retain technical people in science and give them the career path and recognition? So you mentioned some of that already in the talk but perhaps you can go a bit more into detail. Well, my main idea would be guys, you know, universities cuff up those permanent positions because if you don't get publications because you're working on software and if nobody even knows you exist because they use your software but they might have never met you really. It's really tough, right? So I think there needs to be, it needs to be more like front and center on group leaders or professors minds that these people need more safety, right? More of a job safety. I, unfortunately I don't have, I don't have solutions but I think science is a bit of a big, it's a big field. So in fundamental research, it's quite difficult especially in Germany, you don't have this middle level permanent positions. You either become a professor or you're on a three year contracts or your life basically and in applied science. For example, I was, I'm totally grateful and lucky my position has turned to permanent this year but that's because in applied science and where I'm working, there is a real competition with the industry. So I mean, we just leave after a while if we don't feel safe. Yeah. Yeah. Okay, so another question here. I'm just gonna do one more question, very quick one. Someone was asking where the job offers are listed. Obviously, we have people interested in maybe J.C. and the Helmholtz. Yes, you know what I'm gonna do? I'm going to post the links in the breakout room. I think that's a better solution. Right. So, right, so to wrap up, thank you very much again for the talk. That was very interesting and. Thank you so much. I hope you're going to try the library. Yes, of course. I mean, we haven't, I didn't put up any questions about the library. There are questions about the library. Of course, I'm gonna copy all the questions that I have noted into the breakout room chat so you can then discuss them there on J.C. So thanks again. Okay, thank you so much. It was great to be here. Thanks, bye-bye. Bye.