 All right. I guess we can start. So welcome everybody. My name is Alexei Krobrov and I'm an open source science director at IBM Research and our discovery team and my colleague Tim Boneman is the community lead who is going to co-present with me. So maybe some of you have seen us at the open source science booth. So I will explain what open source science is and what relationship to IBM and non-focus it has and a map of science is a project of open source science right and it's an initiative. So it's a working group and I'm going to talk about the ideas and the data and the goals of this project and hopefully some of you can join us in it or send some folks who can benefit from this right. So open source science is an initiative which is IBM's open source strategy for science right. So I joined IBM Research three years ago versus a technical lead for quantum ecosystem development and then we started a new team called Accelerated Discovery which basically solves hard problems of humanity with science such as material discovery, carbon capture materials for fighting climate change, new drugs to fight cancer and diseases, model climate change for sustainability research so basically really hard problems which involve on the one hand involve deep science right you need to model climate data you need to model chemical materials and obviously there is a lot of work for machine learning and AI and now foundation models on the other hand you need to use a lot of software you need to use a lot of tools and it turns out that scientists approach software differently than professional software engineers I wonder who here comes from a scientific background or who had substantial kind of scientific background at some point I know Harold had some others right so just to take a pause okay and who likes science who likes to who cares about interesting problems solving the science right so I guess the there is motivation on kind of both sides of this because scientists really need open source tools right scientists usually do not have too much money and scientists work collaboratively so scientists are wired by their DNA to do to use open source on the other hand the difference that scientists have against kind of traditional open source developers such you know is often the Bay Area and I come you know I spend this both the grounds right so I've been in startups I've been in Amazon and I've been at you know scientific labs developers generally have the mechanisms of data exchange wired into them you know to find blogs you know to find tweets you know who to follow right you know who in your area is an authority so a conference like this is our professional tool of destination of information right like you know you come to the summit like this is divided by tracks is divided by area you like JavaScript you go to just open open gs summit you like machine learning yeah you come here right you generally know where to find stuff and you know where to present stuff so scientists do this for the areas right they do this for subject research they they have conferences for physics and they have conferences for chemistry and plasma physics like and and so forth and they have you know the american you know chemical society has a massive congress in the same place with data bricks because they they are in the summit but you know in we attend both and they're extremely different right if you put up an open source science sign at the data bricks conference you'll see a lot of people if you do it in a chemical society they'll get all your swag and not necessarily ask you right because then like looking for chemistry not necessarily software yet so what we need to do we need to connect scientists with open source developers in an efficient fashion so what we've done so our strategy is basically community centric we created open source science at non-focus abm is a general sponsor of non-focus for many years and so we inherited this relationship and upgraded it and non-focus who knows about non-focus let's say this who knows about jupiter pandas non-pi sci-pi any of this right so all of you generally know the projects which sits under non-focus and nobody knows that you know they are under non-focus so that's a non-profit set up by travis elephant to host non-pi non-pi is the most installed open source project of full time compared to according to metrics it's the foundation of any data science library in python right and basically over the years there is about 80 different projects including scikit-learn pandas right which is huge and used everywhere but non-focus is kind of small compared to links foundation and people most know its projects so we have this initiative there is a top-level group spanning basically all activities and the goal is to first we need to understand the communities of scientists who use open source software so the goal of this is to accelerate science with software not produce more software with more software which usually software people do right like you have two pieces of software there will be third piece of software to connect the two pieces of software so we have a self-perpetuating you know virtuous ecosystem right but scientists generally don't care about software like if all the software will disappear and cancer will be solved cancer scientists will probably take the deal right so it's kind of the the in for us it's very hard to comprehend right because we live in software and we enjoy software so it's like to see somebody whose end goal is not better software about something else it's it's it's new and and scientists are basically very passionate about their scientific research so they will use tools whatever necessary they will use paid windows lab software because that's what they have and if they cannot use it you know they'll not usually try to break it right like you need their colleagues who research software engineers uh it's a new term kind of emerging who basically care about science but no software no hardware no data pipelines right so so we need to find these people and and we go off to them in the same fashion that we do software engineering communities like this one right you find people who are passionate about domain so if you want to do a conference on big data you find people who are good at spark right and like uh and Hadoop previously and and you you find you know interesting topics they usually do in the companies and you invite them to share and they come and share right because uh this like the people like to share best practices uh if you you know look for open source science people you find scientists who are good at software and this is a very tricky situation because they need to know enough about the domain right to have kind of bird's eye view uh to kind of do really impactful things it's easy to find kind of graduate students focus on their own little niche it's harder to find somebody who makes choices and drive some impactful software and so and so far we're basically spending here has been started at sci-fi in uh Austin uh which is now a major non-focus conference and uh it's so this is i think the trick so this this is very much social engineering so once we have the communities right so we have interest groups by by vertical uh and we find the right people right then we basically uh want to find what is that gender what what field uh what software can advance this field forward right uh and and basically identify gaps and needs right so some uh research needs features and existing software projects and some projects don't exist and to be created so like lab view eliminates uh lab automation it's a windows program which uses rs 232 port right like it's hard to find people who understand how to replace it by an open source piece of software and most science are not interested in this right so a lot of lab automation actually the kind of top nerds in lab automation apparently they use lab view and they extracted weird data into an excel spreadsheet like this is the top automation and now so data is liberated and now can be put into csp file and now can be put through some motions right like this is the level uh a lot of stuff is in completely proprietary equipment like camp speed makes this amazing ai driven robots right but it's all call source they have some uh open source pieces but that's not how they think right like it's not the main goal so uh in the so once we know right we have the communities of open source developers and scientists who hear about specific verticals uh now we have uh we identify some needs and gaps and then next phase we could direct resources to this because there is now huge amount of grants huge amount of initiatives in europe and then the us you know nsf or acd horizon 2020 so all these initiatives uh u level government level regional level right they direct funding to open source science so science has an advantage over industrial open source that science is generally an expense nobody expects to make money in science science is government spending in general or non-profit spending and charitable spending so most science is financed by governments so for instance nasa is ahead of everybody else in cloud in open source as if compared to government agencies and the reasons nobody will sell you galaxy data right like it's free you'll not make money out of it usually so and but also it's the vast amounts of data right so they had to figure out how to handle this data and process this data on a government budget right but the government kind of pays for uh equipment so they approach cloud they create workflows right and they're not don't necessarily come to the things people use in silicon valley because very often they're not aware of this and they do not have incentives to come to meetups in silicon valley so it's very interesting that the information on science spreads through usually the verticals so whatever chemists use now is used by chemists and what's even more interesting so the scientific python ecosystem so this is a scientific python project which is another top level project of navocos and so that's been going on for many years jared millman is one of the co-founders and uh this folks basically created jupiter right so um there is a lot of great foundational projects we now use due to scientific python community so uh scientific python community uses desk for concurrency and in silicon valley we have ray for training elaps right and when ray folks present to sci-fi literally you can find users yet right because desk emerged through the vine as a way for sci-fi people to to do concurrency and and so so you you clearly see it's it's not always a technical in this case it's just not a technical choice it's a question who knows who and people who did desk they operated in the community for a long time same thing with jupiter so people who do jupiter hop come from the community right so the tools are self-selecting most non-use in the situation you know is an industry where you have tight competition benchmarking it's not it's not uh necessarily uh you know the case in science so so these are the interest groups we have so far we have uh groups on chemistry material science healthcare and life sciences climate assembly they reflect IBM's priorities and research and everybody else is welcome to stand up their own groups so we have currently folks you know who want to do more space and then we have horizontal groups which span technologies so reproducible science everybody cares about reproducibility everybody means different thing right to to scientists five years of reproducibility so that's something we we are very you know active in uh we sponsor the first ACM conference on reproducibility which was held in june at the university of california on the cruise and we have an interest group on this and the another interest group in the map of science right and so map of science is an effort to map all of open source used for science and so this will be the the next part i just briefly mentioned so well this is our partners non-focus is the eye of us around the world so if you have an ospa usually we partner with ospas and IBM works with a lot of companies and directly so we can bring this and this group is called future software so these are actually representatives of innovation ministries and government agencies around the world which met in amsterdam last year and now they're meeting in montreal and they're basically drafting amsterdam declaration sustainable research software so the idea is that research software should be done with community it should be findable it should be you know built according to the best standards and that way it will persist and so we we actually find that it's extremely well overlapping with our mission right so once we map and we find where the software is and which software actually drives science it brings visibility to to to that software and hopefully makes you know brings the resources to make it sustainable so this is a bunch of organizations supporting us so what is the map of science so when we started this we understood that basically we want to get to the future where science is driven by open source where open source mentality drives science and open source mentality means you know people are basically you know rewarded on merit right if you contribute code everybody recognizes useful you get a lot of you know acceptance in the open source project you contribute your open source projects is used by multiple organizations there is clear ways to see validation and and merit in in open source which is often harder to get in traditional science because a lot of it is kind of social trust network which is kind of like a church right it's a hierarchical system which evolved over the centuries and a lot of knowledge is kept by senior scientists and kind of junior scientists apprentice and kind of earn the the trust of the community which has its own important merit but these two systems are now bound to collide right and so inevitably open source will disrupt science it will not supersede it but the kind of merit system which comes from open source will more and more connect right interweave with scientific recognition and so now a lot of scientists junior scientists who do software they complain that their software is not recognized as contribution as much as papers but it will change now you know you can cite software in the Journal of Open Source Software you can make publications and hopefully that will be more and more counting as as results so what are the objects in the map of science it's it's hard to drive this by product market research as startups do right like usually if you have Uber for pets you kind of have an idea you can hire a product manager and they will interview a bunch of pet owners and you know what do they need you know from an Uber which will drive them about here there are multiple incentives multiple views so we need to hypothesize we need to basically do it in a kind of stiff job fashion right we need to put it out and we need to get users to it so the entities we have are papers people right that's traditional science then we want to know teams and labs which people these people live because usually labs have a theme and they have an organization it matters where organizations are because usually the funding and the priority are dictated by their national affiliation their national kind of agenda but also different universities are famous for different things like schools of economics you know if you want to study specific kinds of economics you go to a given university University of Chicago does it one way University of Pennsylvania in other way and so different universities are known as sites for excellence in different sciences right you do material science you go to University of Chicago and University of Illinois and so forth and so we need to know these organizations of course OSS projects so papers now started to site OSS projects it's not uniform there is no way like to standardize this people are starting to do the noden do is but again there is no widespread practice most normal often you will get a URL right and actually sometimes you need to normalize the URL and sometimes the URLs are duplicated like repositories can be forked so so there is another layer you need to understand what projects are and if you can identify them then sometimes you have initiatives you have you know some universities have a whole labs like Berkeley has a rise lab which produces rain but it also produces other things right and previously they had the amp lab which produced spark and messes and tachyon alexia a lot of three kind of startups which came out of amp lab so you need to know kind of the overarching kind of agenda of a lab and then of course there are grants right and academia everybody spends their time writing grants and if you are an administrator you actually want to know which department is most effectively attracting grants to do research and then if you show that all these grants is spending a lot of this is spent on software you'll get an equivocal bind from from the administration because they may not care about your priority but if you see you are attracting funding very successfully you're funding students and you are actually finding open source into your university they will pay attention so this is what the University of California is on the cruise has done they looked at how university commercializes intellectual property so there is an IP transfer office in every university and you can actually now you know attribute the income to open source used in the grant which was used to produce this IP right so this is a very innovative way to come to the administration with you know a table showing that millions of dollars were received due to open source that way you'll always have attention of you know any kind of administration so so this is a very important thing and we'd like to be able to do this right and so the users are as I just described some kind of administrators supervisors who actually drive activities in science scientists themselves who need to discover open source used in the area if they're not yet fully familiar with this if their small group is maybe isolated maybe they start now they may not know this they may just not be users of open source or software in general they maybe start now and developers right there are personal developers who kind of tire to write ads and you know as David Parton's famously said you know the best minds of our generation are put to follow clicks online is this a good use of our time and he called in 2015 for developers to fight cancer and actually you know some of the developers joined some of this biotech labs of course very few but this is kind of for some people it's a meaningful activity and they're looking for it maybe as a part-time hobby activity they want to make an impact in something uh affecting scientific research right so so there is a lot of important motivation open source developers are motivated by you know passion so so this is a very good way to give them a way to apply their passion to to something meaningful so in our team we'll talk a little bit about a prototype of science that we built recently uh do you want me to go to the browser or the way okay all right let's see I think I need to first get out of here and then I can go here I think probably can maximize that all right great hi everyone so I recently had a chance to go to Greece for a conference that brought together 300 computational chemists and I just started to ask them do you use open source in your work what kind of tools and it was very interesting and so we just started to create a bit of a demo uh that we can use at events like this and at the booth to get you know the conversation started so we're using a tool called kumu and uh basically we just uh plug in these uh these tools this is like the first iteration here you can see maybe um dimly there these orange dots these are uh open source tools that are used in computational chemistry and then I went ahead and pulled the list of contributors from github and gitlab and so after just a few more tools being added you can see that there are people emerging that are contributing to more than one tool these dots in between those uh little bubbles there and uh fairly quickly you get the view this is like 17 tools and I can enlarge it here and you can see there's people obviously that are contributing to more than one some even more than two or three uh projects now that is not in itself an earth shattering insight um but if you look here in the um in the bottom corner there those are actually two papers and here we see these are four authors that have uh you know published both of these papers and if you zoom in what you can't see yet because we're using the real names from archive and the usernames from github um but what I have confirmed manually is that one of these at least one of the authors is also a contributor to this uh project and so now with uh I need you to use your imagination uh imagine a you know uh uh a map that has you know hundreds and hundreds of open source tools in a domain and maybe tens of thousands of research papers um and more fidelity in terms of you know the the people knowing like who's the main contributor who's just an occasional contributor um and maybe you can even double click on some of these people and see find out more about them see which communities they might be involved in or where they hang out online where they discuss their science and hopefully it will on the one hand increase um or improve discoverability of existing tools so you don't always have to reinvent the wheel all the time um but also it might encourage people to just connect earlier in the scientific process with the folks that are already in their little uh area of interest um and um I hope that makes sense as a as a concept of course we're you know a lot a lot more work to do to to flesh this out but um uh it definitely helped us to get feedback uh talking to these scientists uh as they imagine how they would use such a tool and uh and we could discuss some of the uh the use cases you know that some feature requests and some uh questions around tools that aren't strictly open source but are free for academic use and so on so it's very helpful exercise to get this get this going uh we are currently finishing up the um high level concept for this map and uh we'll move it forward from there right actually I'm going back to the um were there more slides I think so yeah 15 minutes left so let's use it for um for a discussion right so I wonder you guys are all you know OSS or GS right so you kind of understand how it sort of works and a lot of you have scientific interests right so we kind of right so we think why do we need this when we tell these two folks that we want to build this and the reason very simple we want to go somewhere you need the map you cannot you cannot find out how to get from A to B you cannot draw out unless you know where A is and where B is right or B is we know we want to be in the beautiful world where science is permeated by software and everybody is a developer LLMs can give it you know to us in some way but uh not immediately and not meaningfully yet right so we need to to build this map but exactly the details different and kind of folks generally people agree this is a great idea and if this is not a research project this is not a PDF this is not something we want to do once this should be a living and useful portal and so initially when we actually started we were very ambitious we thought we'll need to build a portal for science there is no social network for science if you think what science is exchange information generally go to LinkedIn and they go to Twitter so Twitter is a machine for URL exchange right people professionals generally share URLs on Twitter so that's how you find software that's how you find papers right so 60% of kind of scientific URLs between scientists contain tweets contain a URL right we thought of you know things like scientific bundle if we build this map we can let people put together papers ideas code and maybe send it to other people and say hey what do you think of applying kind of this algorithm to this data maybe data sets right data sets should be indexed of course as well so so that's kind of the general idea and uh we're gonna have the first workshop funded by Chansey Kerberk initiative they had quarters they basically selected 50 top researchers in bibliography meta research uh an open source for science and this is going to be run in redwood city in the headquarters this is by invitation only uh if you want to be invited to the next one let us know uh right so this is so and this is gonna be hackathon so we're gonna build a chunk of this uh we're applying for grants with non-profit foundations to get developers to do this right because it's it's a build intensive project uh these are our communication channel so just leave it here as our URL if you go to this URL which is our name and also our URL you can sign up for the newsletter so but I just wanted to kind of leave it here right and ask you guys uh what do you think if you have questions if you have suggestions if you have feedback and maybe we'll use this remaining time as a Q&A and discussion because we really want to know you know the opinion of this group and don't be shy any any kind of question if you didn't understand something or you object to something it's also it's all good okay we have a question good hello um thanks for the talk that's great um it sounds like you need to motivate the scientists to actively um yeah do something to enhance this map of science right uh how do you tell and I would like to motivate the people that are already swamped with paper deadlines and work to do and so on excellent questions uh do you work with scientists do you know because it's very deep you you get at the heart of the problem I've uh been a scientist but um I switched to industry by now very good yes so that shines thank you for that right so this is very typical and so this summer we did open source science on rails we went from chicago open source science symposium we started to sub-in Austin and by the on the way we started by St. Louis University so the incoming dean of engineering used to run open open science foundation from work right which was basically this kind of software it's kind of replacement for lab notebooks so it requires scientists to go back and file their research and of course what happens they stop doing this right like scientists have enough stuff to do then to fill out forums and like I mean it's generally it's thankless tasks to to ask people to classify themselves come up with anthologies right like if you like anthologies is the end of this like once you start to argument anthologies things are done right so so we want to very carefully avoid this so we thought of different technical approaches so first of all at IBM we have approach called deep search and many others like this exist so now you can basically slurp the whole archive.org into a search engine and then you can brute force extract mentions of of URLs right so and then you you can use classification to understand what the paper is about it's already has some classification to begin with right so you can basically pretty reliably extract URL links and search engines and scientific meta search engines were doing since 1997 I was in the group which built the original sites here right which was the meta search engine from scientific literature so so you can kind of we could have bought them up right like parse the internet now you there are companies like source graph there is I think was a company which is a sponsor here which basically you know given the code base right like understand the dependencies and obviously you can piggyback on the supply software supply chain work right so you can kind of say okay like give me the links to software from papers extract automatically then go parse the dependency graph and give me dependencies and classify this right so this this is a legitimate direction the problem is if you look at github it's littered by like dead repositories which are clones right then scientists are the last people to kind of think of carefully maintaining deduplicating the reporters there students use them right for the class projects so so I think the source of truth is hard to find so so we realized very quickly we do need some seeding so we need basically this interest groups which we have as a social backbone of this initiative we want them to do something of this nature right so very simple ask is ask a scientist give me five tools you or your colleagues are using let's do ask them right let's not overburden them but like we're not chemists until we started to look at chemistry we didn't know you know about rdkit and rdkit turns out to be number one package used by computational chemists to the tune that you know a company syngenta which is a fertilizer company now has a full-time person and supports rdkit because it's used by computational chemists like this is the kind of stuff we want we want open source which is judged to be critical by practitioners and obviously like a few people cannot find it all of this like you need to delegate it to you know people in in disciplines and so our job is kind of to find the right folks then who can ask the right questions and even not just all just two of us but the respective chairs of interest groups so we kind of first of all we believe we do need this kind of good working organization which will properly like eventually whatever stuff you find humans need to review it right so we need to work on this one approach is obviously map people want to be on the map so research gaiters like if you guys publish any papers anywhere you're receiving spam but from academia dot edu and the research gait who says you know hey john smith is it you who wrote this paper cited by john doe you know in this conference and of course you're now very curious what they said about you right and and then you want to go there and say pay eight bucks a month and we'll tell you right like this is obviously a growth hack and we'll not charge them eight bucks so hopefully people will come but there are many ideas like we're open to ideas you know we kind of think we'll have kind of champions of open source and science what passion is enough to give us some thoughts top down we can then use data mining to link it from both them up and we're very open to ideas like if you know good ways to motivate scientists connect with us and let us know other questions or any ideas or feedback i mean does it make sense would you use this map for instance to help a science project so we need to motivate open source developers to so all right so that's basically our context if you guys have ideas you know they have the same like good ideas come on the way out if you have feedback if you're interested if you have colleagues in science and open source who cares about science right send them here and we basically are set up to welcome people we had our table we're going to be wrapping up but we had our booth we had basically how many people came to us and learned about us thank you for doing that so and we'll be now on like i think hopefully at many of the Linux open source summits we also pay data conferences they are run by no focus so if you see a pay data meet up in your area come see highs and send people so we're very tight with the pay data network which is now around anywhere like the around padded in new york there is padded global we just held padded Amsterdam so that's a very fun place for open source science and data science thank you very much