 So yeah, thank you for having me. My name is Lai Yi Olson. I am the project director of Measurement Lab. Yeah, I can't say I've ever talked to 850 people from my apartment before, so this is great. I'm specifically excited for this conference because I feel like most of the people that we talk to are internet researchers and broadband policy and advocates, advocacy groups, and we learn a lot from them, but there are actually a few open internet projects, open internet performance measurement projects, and so it's really great to be talking with folks who understand the challenges of running an open project, and so my goal for today really is just to describe Measurement Lab enough and our data enough so that you can get enough of an understanding to help us understand our challenges and engage in follow-ups afterwards. So I will start by just our agenda for today. I'm just gonna ask the question, what is MLab data? When we say MLab data, what is that? And what is MLab is a question that goes with that. I'll touch on how to access the data and then get into some of the challenges, and I'll also talk to those throughout. And yeah, the idea here is that hopefully I can follow up with the folks here about how you all are approaching these questions of running an open project. So the first question that we get is, what is MLab data, what can I do with it when you say open internet performance measurements? What does that mean? And I think a way to answer that is actually by starting with the question, what is MLab itself? So I'll start, a good place to start with that is talking about the project's mission. Our mission is to measure the internet, save the data and make it universally accessible and useful. Already we get into some of the nuance here because there are so many ways to measure the internet. You can imagine that the internet is a complex network of networks. We know this, and so when you add measurement into it, it only increases the complexity in terms of understanding it and measuring it from different vantage points, et cetera. So there are many ways to do this. MLab does not necessarily claim to do all of them, but we do have principles that help guide the decisions that we make and are what we advocate for as well. And these principles are ones of open user contributed and longitudinal data, so we'll break that down. I mean, with this first one, I'm a little bit preaching to the choir, right? But open data we think is really important specifically with an internet performance research because when there are decisions being made that affect consumers' internet, we believe that the source of that research and the analysis that goes into those decisions should come from an open source, an open data source where everyone can see the numbers that were used to calculate the other numbers, right? So open is part of our methodology. We have open data and all of our code is open source and we'll talk more about that. Sorry, I'm in New York and New York is somehow still loud even though it's in quarantine. If you heard all the background noise. Don't drop home at all. Okay, good, thanks. We also strongly support and are also in our own data use user contributed data. So the idea here is that by users contributing the measurements of their internet, it's representing the consumer perspective. So there are many valid ways to measure the internet that are not user contributed, but we believe that at least some of them should represent the consumer's perspective when making decisions that affect them. And the last part is also probably preaching to the choir here, but we do believe in longitudinal data being important in that it's important to understand that the internet is going to be different from one second to the next, right? So it's to make any sort of decision about it. You kind of need to be able to understand it over time and how it's changed from one day to another one month to another and so on. So measurements of the internet are only so useful in isolation. The real analysis can start happening when it's looking at it as a longitudinal data set. So these are the kind of core tenants of how we think about our measurement methodology and everything that when you talk about MLab, everything will follow these kind of guiding posts. But what is MLab is still the question and the six components that I'll talk through are here. The first of which is it's the team of people. We are a fiscally sponsored project of Code for Science and Society, which I believe is an organizer of this conference as well. And we've been there for about a year amongst other open data and science projects. Myself, our program management and community lead and our platform engineers are all staffed there. We also have had a number of contributors over the year, kind of most prominently years, most prominently Princeton's Planet Lab, New America's Open Technology Institute and several others. Google is a current contributor and they support the project by contributing a number of infrastructure support, also internet performance research and a small team of software engineers that helps us write open source code for the platform and pipeline. So our team is made up of both Code for Science and Google team members. The next part of MLab to get into kind of the what is MLab question is the, it's useful I think to go back to kind of our origin story to understand the problem that MLab is really trying to solve or was originally trying to solve and still is. Back in the old days in 2008, Vint Cerf started this conversation around, what is really missing in internet performance research? What are the resources that you don't have, you meaning internet researchers that you would like to have to be able to complete the experiments that would give us better information about the internet? The widely shared answer was that there was a lack of widely deployed professionally maintained servers with enough connectivity that could support the experiments that researchers wanted to run. So you can imagine that it's actually quite, the internet is big. So it's difficult to get a good sized infrastructure to be able to get enough to get a valuable amount of information from different vantage points. So that is what the MLab platform is trying to solve. So we do this by hosting our servers and what we call off net tier one data centers. By off net, we mean it's where ISPs or it's in ISPs that primarily host content versus primarily connecting to individual people. And so if you can imagine, it's kind of the nodes that connect the nodes and that's where we host our servers. So the idea here is that we're getting the most relevant path to the consumer from consumer to content and we host our servers in about 130 plus locations and we have a little over 500 of them. And again, this is to be able to provide a resource for researchers where they can host experiments that get a wide breadth of information about the internet versus just small little portions here and there. It's easier to collect at a large scale than to try and join all of these tinier data sets. So the platform on these servers, like I said, this was so that experimenters could host experiments and we currently host about three of them and I'll show them in a minute, but all of these experiments are reviewed by our experiment review committee and they're mostly proposed by the academic research communities. Currently, we host NDT, which we'll talk a bit more about later as this test is currently maintained by the MLab team. We also maintain a test called, or I'm sorry, host a test called Dash, which measures the quality of video streams and then also Wehe, which measures the differential treatment of applications by ISPs. And so these two tests, all these tests are examples of ones that were proposed to our committee approved and then we host them on our platform. There are various requirements of how this happens, but the most prominent one gets into what we talked about with their principles is that in that they're required to be open. So again, with this idea that not only the data needs to be open, but the code that produces the tests that produce the data also need to be open. And we enforce this by requiring that the tests run in an open source Docker container on all of our servers. It's interesting because what happens here is essentially what we're talking about is providing not only an open data set but an open platform as a resource. And so there are parts to that platform that affects the results of the test and there are parts of the test itself and the methodology that it uses that affects the results itself. So it's one of the challenges that we face is making distinctions between the two between the platform and the tests themselves. And I'm curious if there are other open infrastructure projects here in how you handle that. In the way that these tests, the experiments are run is that the MLAB platform hosts the server side of the tests and the client and people develop clients to run tests against these servers. So this is where the user contributed perspective comes in where users can develop clients and also run tests with those clients that then test against all of those servers that we run. Anyone can develop a client and this is where we get into our open principles again. It's important to us that anybody can do so. One of the versions that people or that have been developed is if you Google how fast is my internet you get NDT which is what we refer to as our speed test. From there, all of the tests that are run produce data and that data goes through our pipeline and is archived in a public Google cloud storage archive. The idea here is that we collect it all and it's all public and you can access it if you want to. We will say though that is a lot of data and very few people want the raw data so we put it into BigQuery and this is actually how most people interact with our data is through big, I'm sorry, through MLAB. Most people know of MLAB as our data set and they know that through accessing it in BigQuery. Oh, see, it's not written. This is getting back to our origin story. We're trying to answer this question of researchers we're saying and even if we did have this platform there's going to be so much data and there's no real way to share it with one another and so this is what our second part of our origin story or answer to that is by putting all the data into BigQuery which is a distributed database provided by Google that allows researchers to access all the data in one place. We also provide visualizations but I will say that that is under construction and we are actually looking for a data visualization contractor so if you're interested, please let me know. These slides all probably powered through a bit but this just gets into the nuance between the testing methodology of the tests that we run and then the platform methodology and how they're kind of different and so when people talk about MLabdata it can be useful to try and define which one they're talking about are they talking about the platform or are they talking about the test itself? Also, if you have interacted with MLabdata or if you do, it's likely that by nature of it being the only thing currently hosted in BigQuery as well as being our longest standing test and also the fact that we maintain it when people say NDT, it's often used interchangeably with MLabdata but they're actually a little bit separate just in that NDT is a test that we host and again, I'd love to talk to anyone else hosting open infrastructure that's used as a resource and how you kind of maintain that balance between what the resource is and what the usage of that resource does. And then the last thing on our data is just getting back to this idea that one test is only going to be so useful but when we put all of the tests together and a data set, you can start to see trends of internet performance at a large scale and this is really the value of the data. NDT just like any test is only as useful as the one test but when you start to put it together with other tests you can start to see trends based on different factors and that's really where the kind of magic of the test or of the measurement happens when you combine the two. So when you are able to put NDT, a lot of tests like NDT or any other test on a scale as our platform allows, then that's really where you start to see the interesting information and we see this by, we've had, we get something like 3 million NDT measurements per day and we have close to 2 billion rows of that data in our table as of 2020. So it's massive and it's all public and I think I'm talking to people who understand that that's really important because even if you want to use this, this information is really good at kind of showing you a zoomed out view and then being able to ask better questions about where you should zoom in. So we think that it's important to start big and then go smaller, especially when you're talking about something as complex and increasingly changing as the internet. The other way that we engage with our community and through open source code is by creating tools that are able to drill down more geographically. So at the default level, we only go down to the IP level, which is about county. There are sometimes where researchers want to be able to drill down and say, you know, but like what households are able to access the internet at a certain way. And so we have a suite of tools that are designed to be able to provide that information, but also in a privacy sensitive way. And so these are the tools here and I won't go through them all, but they're all designed in this way of saying, here's the public data set. Here's how you can get more information about your specific location. And then a lot of the work though, aside from running the platform and the pipeline and hosting the data set is engaging with our community. Our community is what makes our data, does that last part of our mission, which is making it universally accessible and useful. And so engaging with researchers on how to best use the information that we provide, also collaborating with experiments or researchers that would like to propose experiments. There's an interesting thing that happens too, where I'm sure a lot of people actually have this here where everyone who uses your data is coming from a different level of technical expertise, of different interests, of different points of publication where they'd like to use the information. So there's that fun dance of trying to understand what meeting everyone where they're at while still maintaining the standards that MLab would like to see. I also, we try to say too that like everything that I presented is MLab at this present time and we are very open to collaboration. And I think that's one of the great parts about running an open project is that you always are looking for feedback on how to improve your measurements and expand, or not measurements, but data and expand it to be useful to more people and act as a public resource. I have slides here about how to access the data, but I think the slides are uploaded. So I'll just, and if not all upload them, but as a resource for later, if you'd like to look into how to access our data. And this was really the whole point that I wanted to get to is just to reference some of the challenges that I'd love to speak with you all about and follow up. I think it's interesting like there's this kind of thing where if it's one of them is that our data is free to access. So you can access it, I didn't even say that yet, but it is free and you just have to sign up for a discuss list, which we can get you access to BigQuery. And I think sometimes there's kind of a speculation of like how good can it be if it's free. And so if anyone else experiences that, I'd love to talk about that. There's of course this tension always an open data set where you're collecting, you want the information to be public, but you also want to do so in a privacy sensitive way. Clearly there's technical nuances to that with internet performance data, but just would love to talk to anyone who's also addressing that. And the last one that I wonder about for folks here is the kind of challenge of running what we consider and want to consider as a public resource while still maintaining governance that reflects the principles that we'd like to uphold. And so maintaining that balance of a community-governed project while also having your own principles that you're in support of. That is MLAB, thank you for having me. And follow up, please reach out to my email or I am on Twitter sometimes. Thanks everyone. Yeah, thank you. I mean, those last three points there, we're definitely the things people were bringing up in the chat and in the questions around. So I think that's exactly the types of challenges and opportunities that your project brings up. So yeah, thank you very much. We don't have time for questions because we do have to move over to the keynote, but thank you again for the presentation and we'll post the questions over into Slack. And so please respond to those and anybody who has questions for Lai, please go over there and ask them within the Q and A channel. Okay. Thank you. Thank you.