 So I would like to thank you all for waking up this morning and coming to my talk on the walk over here I almost wanted to skip my own talk because it's so wonderful outside. So thank you for coming here. I really appreciate it And I would like to give a special thanks to the organizing committee Thank you for inviting me. It's such a privilege to be on stage today and to share my journey into Open source and also working in data science and some of the projects that's chingential to both of those So I have two questions to post to the audience today There'll be more but these are the two that I'm going to start with at least What does it mean to be human centered and how do we use data to empower build and grow? so I'll just let that percolate for like 30 seconds and This is actually the questions that I use to anchor the talk that I'll be given to you today Okay, 30 seconds as long. That was uncomfortable. Okay, so when I was first thinking about this talk, I Went through many iterations one of the iterations was I was going to give a journey through my PhD experience using all the allegories to Harry Potter books and they're basically all the Later books like before through the end because that's where things get dark in that pretty much from the beginning but I wanted to spare everyone like that cathartic moment I shouldn't use the stage just like a vent on that but what I would like to do is talk about some friction points and science where it comes to doing analyses using tools being able to share your work and I will Walk you through some of the projects I've been involved with to help reduce that friction and at least make it easier for scientists to get involved to contribute and form communities around it But before I do that, I would like to ask you another question. I told you there would be more than two Is anyone familiar with the book that's on the screen? great, so this is a human centered data product the this product is called the Negro motorist Green Book also referred to as the Green Book and it was a way for black travelers to Navigate through Jim Crow in the US it offer recommendations for restaurants that they'll feel welcome Places where they can get their hair cut or their hair done or their hair did It offered all of these opportunities in spaces where they will feel welcome and this product was centered on the human experience So this is something that I think it's very important when we think about what products we're building How do we build for the users? We shouldn't just build tools because it's just like an engineering feat we need to have it as a way of Actually tackling issues that our users are facing and this was a very important one So a few data points about me. I grew up in Charlotte, North Carolina I went to UNC Chapel Hill go Tar Heels any Tar Heels fans. Okay. Yeah go Tar Heels Then I studied linguistics and psychology very interested in language and how the brain understands language Went to graduate school for cognition and perception and then I did a postdoc essentially in Plumbing data plumbing and I'll talk about that a bit later in the talk So here's some guideposts of some of the things I'm going to cover in my talk today First is just my journey into open source Then we will talk about the view source for science the tagline for iodide in the project that we're working on at Mozilla and Piedite which is a project that is an offshoot of an iodide to bring Python to the browser Next I will be Famboying one of the projects that we have at Mozilla called common voice and I'll talk about why it's important for free speech and open speech and Then I will wrap up the talk about CL which is the cognitive innovation and education labs and the brainwaste project Which is how we are empowering high school students from underserved backgrounds Opportunities to do first-in-class research using high quality technology and Providing access to course materials that most people in high schools just don't even have opportunities to okay, so first I'm going to take you to the beginning of My PhD journey so I joined the research lab Where there is this cookbook if you just follow these steps these recipes that are very opaque You do it and you just hope the God that you can get a nature neuroscience paper out of it So I'm going to walk you through some of the steps of like how this looked and why The tools that I end up contributing to really helps with demystifying in this and actually offering Transparency into the workflows of scientists So I worked in MEG which stands for a magneto encephalography So we would measure the magnetic fields that come from the brain activity Most people are familiar with EEG where you measure the voltage on the scalp So what we know from physics or at least what I learned from physics that I kind of forgot about physics But that every electrical current makes a magnetic field and this magnetic field can be measured using Squids special quantum interference devices just a fancy way of saying magnetometers that can measure Very minute magnetic field changes them to Tesla And I can't even give a reference to what fim to Tesla is. It's just that it's really really really really really really small So we have this data that's coming from our studies where we would have say 157 channels that are covering the scalp It's in a machine that looks like a toilet bowl So you put your head in there and we just read your brain activity. It's really cool but it was developed this this hardware that we had was developed by a a very niche Hardware company where they created their own data format and didn't share any of the data specifications and there was no way to really work with the data unless you were familiar with their tools or You basically hacked it So I essentially end up hacking it What we would need to do first would just like export all of the data into an ASCII text file So imagine Thousand Hertz data so thousand samples each second that you're recording somewhere for say 30 minutes or an hour Where you have 157 channels and then three reference channel and then 32 auxiliary channels So the data are just growing in size and complexity and you're exporting to an ASCII file So pour some out for the metadata that we lost in this conversion So you would do all of this takes about 45 minutes to get this file out because That's just how the software worked Then you have to go to the bash type in the command and basically convert it to another data format and That took around 30 45 minutes itself. So you're spending like an hour and a half before you've ever seen the data So this is how it's been the first year of my grad school just did it Because I was in the software engineer. I'm not a software engineer. I never claimed to be a software engineer I had very small experience with Programming it would be Pearl I learned a little Pearl and my research Assistanship that I had and I still don't even know what it does or what I was doing It was just like scripts that you just like edit and you're just like oh, I think it works now So that was my experience and I took a MATLAB class that he fell. Well, I didn't fail I got a C-minus, but I did measurably and I had to rotate a cow and that was like the project And I just never understood the delivery of the course material was not in a way That was approachable for me being a linguist and a psychologist. It's like it was just geared at I don't even know the audience that was geared at I don't even think computer scientists would actually appreciate it so I Asked over the listsurf what was Did anyone have any experience with the hardware that we had a few people did? And they were able to point me to some files that they've collected over the years This tends to happen You'll find like say this one file that you just stored here, but it's never put in an open and An open place that anyone else can find reference to so I'm pulling together all of these resources one is telling me where the the pointers to the file the data within the files are and how to read the data plots and what type of sensors it is and This is like my first approach into understanding programming. So The first approach that I had was this you type some bash say some incantations you Cross your fingers and pray to the point oh five gods that you're going to get something They just squeeze out some significance and then you hope that The nature got it shined down on you and made grace your CV one day So I didn't really want to like buy into this system. So I tried some new approaches So this is emini Python. So emini was first written in C by One developer who's a physicist who wrote it it works, but see it's a really Unapproachable language if you haven't been taught how to first like dive into it You have to worry about memory allocation. You have to worry about a lot of things that modern and Interpreted languages really handles for you and then I just like gifts. There'll be some gifts throughout the talk So I hope you enjoy them So wow, I didn't realize how small this was but this is like my first commit for This project where I was trying to take the data from its native file format and read it into Emini Python the software for data analysis To remove this step of like having to go to an ASCII file write bash and then maybe get some Interpretable data out of it. So The first commit I wish I could read it to you, but I can't but it goes to say like hey This is like my first attempt. It's a work in progress Love feedback. I Don't know what I'm doing and that's essentially how it goes and then I actually went back and look and saw that I left like really detailed comments in my commit message and One thing that I would like to advocate to everyone is that remember your commits are a love letter from the past So make sure you fill them with rich data. So you understand why you did the steps that you did So it started in January. I spent the month This was over a winter break and then I essentially got to a point where I'm like, hey, I think it works It kind of works. It's doing what I expect What are the tests that I need in order to make sure that's going on I knew that tests were like something that it's like useful in software, but I didn't know what they were It's like what is a unit test? Like how do I like make sure like the code is doing when I think it's doing so This was like my first approach into contributing to the open-source project and then there is this discussion and Then many more comments and then many more commits and then the frustration of trying to do a rebase or Trying to merge in Comments that a commits that say your fellow colleague would like to share with you and this went on and on and on and like it Got to a point where the lead developer and I were on a phone call Just to like walk through how to rebase my code because they had gone so right Then my first a major PR by the numbers. I opened three separate pool requests because the first two were just like lost causes I Had a it was around a hundred and four commits and most of them was like discover pep eight Now understand pep eight Still don't understand pep eight and that just continued on until I started to understand What is the style and what it is to write Python and code and I started learning a lot from the people who were Part of this project. They really like help contribute to like my understanding of how to create resources that would be shared and useful for others and Then after 181 comments I don't understand why software isn't just like peer review because I had to do a lot of work We it finally got merged and it was such a success It was such a rush that I continued and I still continue to contribute to this project So what this led me to understand is okay? We are using all of these desperate file formats and neuroimaging How do I go from say? Collecting data in my experiment and sharing it with a colleague There hasn't been any real efforts to really push this forward other than ones that Labs would do for themselves and try to get other labs to just like buy in there wasn't really a community effort behind it so the brain imaging data structure started as a project to standardized MRI data, so MRI stands for Magnetic resonance imaging, so it's what you get when you go to the hospital and say you tore your meniscus or They want to see if you had a stroke they can put you in the machine and you can get these images that come from it so this was a way for the MRI researchers to standardize their data and then the MEG community so I was like oh that will be very useful for us to also have so I saw that there was like calls to participate and Creating this data standard and I had a lot of familiarity Five years later with different formats. I had spent I had written a handful of other packages to convert their data as well So this is really about like how do you take data from one person and connect it to another? It's the plumbing. It's the infrastructure. It's the the thankless thing that you know It's necessary, but you get your hands dirty doing it So this is just briefly to show you like a schematic that we had for how the data would be structured And this is a way that software developers can really like write the specification into their code to have the data come out and this nice and Adopted way of sharing your data, which is really cool So the M&E team and the bids team they have both been working very hard to build community So scientists can collaborate and This isn't easy Like getting scientists to collaborate. It's like herding cats. It's not an easy thing to do There's also not the same as into structure to really align with Collaborating with say team science versus individual science everyone's fighting for that first author of publication And where does that come when you have a team of 50? So this work has to be very intentional and it has to be very deliberate this is just a photo of like a bunch of data nerds and Montreal at the neuroinformatics conference and then this is a coding sprint that I participated in recently To help develop a better standards for M&E so special shout out to Chris Hallgraph and to Dan McCloy because they've done a lot of great work and just like building out the documentation Creating tutorials really getting to a point where like new contributors can come and use the software and feel welcome We have a code of conduct We have all of these new tools that have been put in place to really like help our project really grow and really benefit the whole community, so I Really thank the entire team for all the work. They've been doing and we know that there's much much more work to be done next up is a Project that I've been contributing to I make my first site PRs just like a month ago But I've been using it since they've been working on it for the past year and it's called iodide so iodide Why iodide why the name iodide? So one has high Recognizability you can say iodide and people like oh, yeah, I can remember iodide and it has low Confusability so you don't say have a package named go and try to search for go And Google search. It's really hard to like get up to the top when you're competing with a lot of like other recognized Words, so we're optimizing for search and it's kind of cool It's like a shout-out to the science community, especially like chemistry the fact that we are trying to do what we call Have the view source for science. You're able to see how Reports are being created by both having the narrative in the code Nicely blend it together and I'll walk you through it. I promise you more gifts. We have more gifts So, what are we looking at here? so These are just like some of the notebooks that we've created with iodide that allows for you to just really use the web browser as a Rendering engine for all your wild dreams for data science, which is really neat We're able to build on top of all of the web technologies in order to bring an ecosystem That's very friendly for sharing among your colleagues so What we want to do we want to tell compelling and expressive stories and you're able to do that Using these technologies the web has been that platform in that rendering engine for all the content that we normally consume on a daily basis now So you have the benefit of that and it's native to the web You can think of it as if you read the New York Times the upshot You see all of these very cool graphics and you would love to know how they made it, but you can't So this is offering an opportunity to see like what is the code that built that really neat graph? So These are some examples of using WebGL and the browser the one in the middle is the Lorenzo tractor And then the one on the right is all of the eviction data in San Francisco So we can actually pull from open Open data portals and have it import right into your native space and you can build visualizations from it So I had at a glance If you hit the explore button, you're able to see both the code that Created it as well as the narrative that goes along with it So what we have here are code chunks that are delimited So you can have markdown and where you can have all the lot tech that you want to add All those equations that you can have to Really go well with you can have here as well as JavaScript. That's ways of importing very cool Libraries to do charting and other infographics. So Why JavaScript? JavaScript is native to the web But we also wanted to expand what the capabilities of this notebook were so This is where Piedite comes into play And I'll give a slide in just a moment about it But the idea behind Piedite is that we can take the python stack The c python implementation and we can now compile it using web assembly Which is really cool. It's new technology that has been adopted by major browsers For rendering native compiled code on the web Which is really neat. This goes into This exploration and explanation That was highlighted in adam rolls paper where you see computational notebooks taking a new life and both being a way to express research that you're doing Also ways to explore that research and also to ask other questions of that research So we're borrowing a lot of ideas with Iodite from established projects and we love them all the same as well We take from our studio and we take from Jupiter lab and we're really just trying to create another way of sharing content Um and addressing our needs at mozilla So one of the things that we had as A pain point at mozilla is that we normally work with like large and large and large amounts of data So what we tend to do we use distributed computing tools such as spark or Pi spark which is the python implementation of spark where we can pull this data into A data cube that is much more manageable. So we do our data wrangling and aggregation And we're like, okay. What is the analysis that we want to tell with this? What we have done in the past is either export to well Very early on with the export to google docs you make your figures You write your narrative and this is what you can share with a product manager or other stakeholders But a lot is lost in doing that you no longer have the source for what you're creating and you no longer have Um Any of the context for how that figure was created So we wanted to be able to bundle both the code that rendered those images or those graphs and have All the data to come along with it and this is what we can now bundle with html. You can just Package it all together and you can ship it and that's our idea of sharing data at our company So as I mentioned pydide is this new tool which is for implementing python in the browser and it comes with Three of the major packages as well as as well as just like the native python packages You have numpy. You have pandas. You have mapplot lib That are working just fine It's using web assembly to compile it you basically fetch and you can have like this entire python environment brought into your browser And with this you also have the opportunity of using all of the web apis that go along with Anything that you do in the web framework So this is really cool If you say want to plot a mapplot lib figure on your website You can now import the document and then say document get that id and then you can just say Here's my mapplot lib figure and it will render and have the full expressiveness of say a tk output of the the plot that you will see in A native like python environment. It's really really really neat and we take from uh Take inspiration from prior work and related work of like pypyjs or zero dependency python and Brython which are other implementations And this is what we're doing is just taking the native c code and using web assembly so that there's uh A clear provenance and how the work is being done No transpiling. This is actually native c python code running in the browser, which is pretty neat So we also have some experimental packages that we're working on. Uh, so we have support Uh Experimental support for sci-pi. You can also use scikit learn in the browser and we have support for Some parts of the four train packages. This is like four train 77 And if you want to use like more modern four train like four train 95 We really don't have that support right now But the the reason that we're interested in the four train is not because we really want to bring four train to the the to the web is that we want to bring r to the web and Being able to bring r to the web will allow for all of those same Nice tidy verse implementations of your workflow to be able to be imported into the space So this is really neat and we actually have a call for proposals out right now So if you know anyone who's really big into compilers or if you're really big into compilers, you can apply for the script I think I put a slide up here. Uh, these these the slide that will be shared After the talk so feel free to check it out We're taking calls up until the 31st and we also have eight other projects that you can also apply So maybe compilers aren't your your jazz There are other things that you can also apply for so with with this iodide is uh It's building the iodide project is building tools to allow for you to tell stories that are And of compelling in a very interactive way and this is just one way that we're approaching and mozilla How we can share the reports that we're doing and we'll be talking much In in the near future. I'm working on a project to bring A data science blog for mozilla So we will have a way to communicate with our users Share the reports and be fully transparent with the types of things that we're using Um To make our decisions We are only data stewards of the data that you allow for us to use And we want to show how we're using it and how we're informing products And this will be one way that we can Share the reports that we have done in order to Make better decisions on your behalf So try it out Feedback is welcome contribute Uh, you can check out these links. Uh, it's a neat project and we would love to collaborate with other projects and Really like figure out how can we develop synergy across like all of these other projects Doing similar things and figure out where we can converge and work together So Another project that we're working on at mozilla is called the common voice project I'm not directly involved with this, but I am such a fanboy of it Uh, the idea behind all of these slides for the common voice have been Given to me from our open innovation team. So you can thank them for these like beautiful illustrations I really wish I had that design capability um But the idea behind common voice is that we're trying to allow Users to contribute to an open data set That is privacy protecting done in an ethical way and that can be shared freely We also would like to build community around this project so that we can build Uh data sets for the communities that we aim to serve So why free speech? So this is a market segmentation of like, uh, voice assistant products that are being used in the us And what we see, um, is that there's an estimated 35 per year growth from now until 2024 In the use of voice assistants So Hey, siri. Hey google Hey, eletza These are the voice that we've been interacting in the world And it's becoming a much more intuitive way for us to interact with the technology Um that we use on a daily basis But what's happening? We actually see that Even though we have such a huge growth in the uptake of voice assistants It's being consolidated among just a handful of companies Which doesn't really provide data to the public To build their own work and we think that this is stifling innovation so We as mozilla like to Be rubble rousers. We like to disturb the market Our idea is that if we are able to build community and crowdsource and do it in ethical and safe ways How can we build tools that can compete against these really really really big giants? In a ways that empower the contributors and the developers to create tools using open data So the main ingredients behind this is that The main ingredients we have behind this is the project common voice What common voice does is that we take cco data of like written texts And we ask our users Would you read this text and donate your voice samples to a collective? So the idea you go to about the website You read The token sentence and then you submit it if you would like to contribute This is really cool. This is a way that we can amass the thousands and thousands of hours you need to actually make a very human like voice recognition software so All of this data is publicly accessible We are the second largest public accessible data set. I think we may have become the first but don't quote me on that But we have really amassed a lot of data from our contributors in ways that we are able to build a voice recognition model for english that is at around 95 percent human like which means that it gives an error around five percent of the time and this is expected um In terms of like the performance given our data set size so the collection of the data is common voice and The model that we're using is part of the project deep speech. So deep speech is creating A voice recognition model that we will actually publicly share With the users taking the data from common voice as input and then model as output It's all in the get repository. So if you would like to check it out see how it works offer some suggestions to fine-tuning the model you're more than welcome to do so and you're more than You're more than welcome to join the project So the idea behind common voice is part of mozilla's initiative to teach machines how real people talk I mentioned english, but what this project is also doing is Tapping all of the communities around the world spanish Spanish, portuguese, french All of the languages that you would like what we will do is help you find data sets that are openly licensed so that we can use um, and then Allow for you to use the platform where all the data is accessible to you in your community And it's creating this way for you to crowdsource among your community and really have The the opportunity to empower and really grow this project So common voice connects the dots at mozilla around our mission around privacy diversity and creating the community To speech space It's pretty neat And common voice empowers our contributors to build open voice data sets for their community The last project I would like to talk about is cl or the cognitive innovations in education labs So cl is newly formed As of like the past three months and there are three of us who are working very very hard to Get it off the ground and really Reach out to schools in the new york city area and I'll tell you a bit more in the next slide So our mission as a nonprofit And we have the aim to deliver high quality low cost hands-on real research experience to underserved communities and unrepresent communities So I'll give you some inspiration for some part of for why we did this project I taught a workshop in Thailand with a colleague of mine from undergrad Pia who is a full bright scholar and we had this idea of blending engineering education And cognitive science education in a way that's very hands-on fun and engaging So we called it diy cognitive psi And the idea is that we created two research Uh devices one was Uh eye tracker an eye tracker that we use Um a playstation i we got really cheap cameras and we were able to tear it apart and retrofit it to our needs um, and then we also 3d printed um a headset so that we can Um install an open source eeg microcontroller so that we can record brain activity And show students their brain activity for the first time So it was really neat. We were taking like all the shelf parts and low cost equipment and access to 3d printer and putting all these parts together to teach both how we're building the research equipment And how this research equipment can be used in order to answer questions that we might have And we focused on cognitive science because i'm a cognitive scientist And i'm really interested in how people think perceive and act in the world and it's a very good way for teaching students how they can think about the world in A very grounded empirical way What came out of this was another project called open experiment Which is branded brain waves Brain waves is a project that we Receive the NIH grant for so we have five years of funding to teach um neuroscience education to high school school High school students in new york city. So the first in fifth years for Implementing for development and implementation Um, and then for evaluation and then two three and four is for classroom implementation So first year for development two three four for implementation fifth for evaluation I kind of stumbled on that And the idea behind is that we are creating an all in one application that allows for the student to connect to their hardware See the data that is streaming in Run a psychology experiment Cognitive neuroscience experiment once you throw the brain in if it comes in neuroscience experiment and then They're able to analyze the data and then Write a report based on the data and this is all built around Web technologies that pull from open source projects that I've contributed to over the time So this has been my way to stitch together say my M&E contributions and my jsy Contributions in the other open source projects and really bring in those communities in a way that we can all like build Toward making an app that will be used for Teaching in high school that can be taught in colleges and taught in many other places Where you really want to give a hands-on experience for research methods so Uh, this is you can actually go to our github and like download it and check it out if you're interested You can also send us issues and file pull requests if you're really daring But what we have now we we Have a way of visualizing the brain data and then we can run this faces versus houses experiment And we were able to time lock the brain activity that's coming in with the screen presentation So students are starting to understand how we Tackle these issues in the lab for trying to make inferences On what we see and how the brain responds to it We also do a professional development training where we train our teachers So our approach is that we work with teachers because they're experts in teaching and we are providing domain expertise in cognitive neuroscience So we are doing a lot of lateral transfer of knowledge. So we come in as partners And with this we are able to Do a unit where we do sheep brain dissections Which is really cool and the students actually get to dissect sheep brains And then we run through a host of other experiments in our first unit And then in the second unit they get to use the brainways app where they're developing their own experiment And run it on the classmates and It's a really great way for them to engage in the scientific method It puts them in the driver's seat And it gives them opportunities to really explore what it is to be a scientist and Really have like true grounding and what goes on In all of the textbooks that they read they have like some idea how the information got there And our target demographic is that we're working with title one schools in new york city So that means that half the student populations some free or reduced lunch And it tends to be disproportionately People of color the kids So we're really trying to tackle the intersection of both The underserved community the underrepresented community and those who lie at the intersection of both So cl and brainways they both help foster growth and hands-on experiments and data collection And this has been my way Of navigating through both the projects that i've contributed to to the work that i'm doing now at mozilla and then Also work that i'm doing in the community and ways that we can really like center people in the way that we Data tools and ways that we approach Looking at the world around us So i'd like to thank you all for your attention And then i'm open for questions