 Hello, CNI community. I'm very pleased to have this opportunity to talk to you about a new project we've launched here in Creos, the Center for Research and Equitable Inclusive Scholarship. The core motivation for this project is the growing evidence that scholarly communication has biases and that the practices of science poses barriers to inclusion. At Creos, we've argued that openness and inclusion in scholarship is greatly needed. That effective change requires systematic research and that steady progress will require reliable measurement over time. In line with this thinking, the goal of this specific project is to produce standardized indicators that describe who contributes to open science and scholarly outputs. Before we go further, there are some caveats for this talk. This presentation represents my own perspective, not that of MIT, our funders, or our collaborators. Further, this is a work in progress and it's tough to make reliable predictions about scholarly communications, especially its future. So to summarize, if there's anything in this presentation that's wrong or you simply don't like, it's entirely my fault. That said, the project and anything good that comes out of it would not be possible without the generosity of IMLS, without the wisdom of my director and the advice of our collaborators. And of course, Anna's gigantum humorous incidentus, this project builds upon the scholarship of many others, only a few of whom could be directly referenced here. One clear finding from this prior body of research is that science and scholarly communication increasingly include open content and incorporate open practices. For example, the figure to the right from a study by Kivore Adal illustrates how the volume of open journal publications have increased dramatically in the last 15 to 20 years. The second figure shows how open access content makes up an increasing share of the entire scholarly literature. And the third figure from a recent study by Fraser Adal shows how an influx of preprints played an important role during the epidemic, making up a large portion of the entire extant scholarly literature on COVID-19. Overall, this illustrates how open science practices are increasingly visible, often, but not always in conjunction with open access. In a related trend, metadata about scholarly communication is increasingly available under open license and through public interfaces. In the last five years alone, there's been a dramatic increase in the availability of open access metadata that describes scholarly works, both open and closed, costs and citations across work, which enables us to have insights into scholarly impact. For example, the image to the left is not just the open access logo, but it's actually built from open metadata. It's from the cover images of thousands of open access monographs. The recent availability of this wealth of information through open platforms enables the study of the scholarly communication ecosystem at a depth that has not been previously possible. A third shift in the scholarly communication ecosystem is towards a recognition of how science and scholarly communication embed social and ethical values. Consider the image on the left, a word cloud generated from many recent landmark reports on scholarly communications, information science, computer science, engineering, machine learning, et cetera. This word cloud highlights how open access and open science norms are frequently mentioned in these reports. And we created table, which summarizes the values highlighted in these reports and which shows how transparency, trustworthiness and equity are increasingly recognized as an integral part of scientific research and practice. Moreover, these reports as a whole draw attention to the need for the conduct of science to be informed by human values, the need for discipline specific codes of ethic and the need for an increased awareness of inequities in science participation and in distribution of benefits from scientific research. And unfortunately, there is ample evidence that equity and inclusion remain dramatically uneven in the scholarly publications and in the scholarly ecosystem more broadly. As shown in the maps on the right, scholarly publishers, as it turns out, scholarly authors as well, are heavily concentrated within just a few countries. And other prior research emphasizes that scholarly communication often reflects gender imbalances, is often dominated by English language, often comprises formats which are not fully accessible for those with visual impairments and often requires fees that create a challenge for potential authors in the global cell. Measuring progress towards openness, equity and inclusion is not so easy. Most of our measurements come through one-shot analyses and publications, such as the figure from Pearl Harbor at all to the left. These measurements are informative, but they're difficult to compare because the sources they use vary from publication to publication. They often focus on different quantities of interest and use different methods of computing them. And they often aim to describe different portions of the scholarly ecosystems across different periods of time. Finally, these publications often occur years after the phenomena that they examine and they lag the open data by quite a bit. So our understanding of the system is typically lagging behind what it could be. In response to these gaps of knowledge, our project seeks to address three core questions. What is the prevalence of members of different groups in open scholarship and open science initiatives and in their outputs? Where are these outputs used within the scholarly ecosystem and how does their use depend on who contributed to them? And how do these patterns of production and use vary across disciplines and other facets of the scholarly ecosystem? We plan to inform these core questions by producing standardized quantitative measures, AKA indicators that characterize participation and to supplement these by regular reports that describe trends across these measures. Creation of these measures will be guided by four data quality principles. The most central goal is comparability. We want to be able to compare our measures over time with different measures of the same concept and with different efforts in this space, such as those being conducted by the Center for Open Science and the National Center for Science and Engineering Statistics. And to achieve comparability, we aim to design indicators that are regular, accurate and reproducible. Measuring indicators repeatedly and over standardized periods of time is vital to enable meaningful comparison over time and to detect accurately trends and cycles. This is temporal regularity and achieving temporal regularity also requires a degree of scalability since the processes of data collection, cleaning, computation, reporting need to be run over and over and over again. We also require accuracy. And as social and information scientists, we know that descriptions of human systems all come with some implicit or explicit degree of uncertainty. So we aim to be tolerant of measurement error and to be honest about revealing the uncertainty in estimates and analysis. And finally, we aim for reproducibility. Reproducible results, results can be traced to their source, it can be recomputed and which provide key information such as provenance, fixity and revision history that allows them to be principally managed. Reproducibility also applies to reports. So project reports will embed data analysis. They'll be able to be reexamined, rerun, even extended. The Institute for Museum and Library Services has generously funded a three-year runway for the project. During this period, we will collect the core data that produce the indicators, produce reports, generate community requested indicators and produce a number of scholarly publications that reflect upon the whole process. However, although only the first three years are funded, we believe it will be valuable to continue these indicators for a longer period. Because the project design approach emphasizes reproducibility and repeatability, we expect the marginal cost of extending the dataset will be relatively low. And we expect the value of the data as a whole will increase substantially as more and more is collected over time. In the first phase, we've identified eight existing data sources that will be cleaned, integrated and analyzed in order to create the initial set of indicators. These data sources include researcher information from Orchid, information on different forms of scholarly outputs produced by directory of open access journals, the directory of open access books in the open science framework, policy information collected by RoorMap, article uses and contributorship information from the PLOS system, citation information from I-40C and editorial board information from the open editors project, which we plan to extend over time. Consistent with principles of reproducibility and comparability, we focus this initial crunch on data that are available under an open license that cover a broad range of disciplines and which are being regularly updated. And this set of sources while by no means comprehensive will constitute a large sample of open access and open science products, span a broad set of authors, editors and presses and contain a variety of signals of contributorship impact and policy. We've divided the project as a whole into three work streams. The first stream of work focuses on these open data sources and on developing an open data pipeline to automate the retrieval, cleaning, linking, normalizing and summarization of this data to produce the indicators. This pipeline itself will be open source and will be complimented by a second stream of work aimed at tracking issues to salience. For example, attitudes towards open science over time. By focusing on mining web and social media sources will follow specific institutional stakeholders using a panel design to measure the same actors at different periods. And we'll also use this mine data to supplement gap and to fill gaps in the more structured open data. The third work stream is a pilot project involving a community consultation process that will guide extensions to the main indicator production in order to address specific community priorities. To fix ideas and to cap off this presentation, I'd like to show you a few early results from the project. The figure on the left is derived from DOAB data and characterize the rapid growth that open access monograph publication has made in recent years. Although one should bear in mind that open monographs are still a relatively small share of the entire volume of monographic output. The figure on the right integrates additional information from the OpenAPC project and summarizes the fee structure for open monographs. Compared to the most recent prior data which was collected using one-off surveys of subvention fees and production costs about a decade ago, the current fees are much lower than expected, which is a pleasant surprise. Another example, maybe not so pleasant, describes journal board composition. This figure primarily draws on data from DOAJ and the open evidence process. We use this information by to apply a standard set of methods for geo entity extraction, gender imputation and use that to analyze the composition of thousands of editorial boards across a range of journals in different disciplines. As a whole, one can see that there's substantial variance across discipline. And also if one drills in, one can see that journal editorial boards are mostly male and highly concentrated internationally. Although engineering is an extreme point in that it is one of the most male dominated journals, say compared to communication and one of the most diverse internationally. So there is substantial variance across disciplines. To extend this analysis, we added data from the Center for Open Science that characterizes and rates journal policies towards open science. And this enables us for the first time to examine some relationships among editorial board diversity, open access and open science. And the figure above summarizes some of these patterns showing the mean diversity of editorial boards as grouped by different journal policies and with the lines extending from those means representing the statistical uncertainty of these measures. Zooming in, we can see surprisingly that open access journals as a whole have a somewhat worse gender balance than closed journals. And this pattern repeats with open science-oriented journals. What this suggests is that we have to address all of our goals systematically. We need to design and measure for each of the goals that we wanna achieve. The early results I showed you are explained in detail in the preprints cited here. And the first preprint also provides a more in-depth description of the statistical and measurement design we're using for the project as a whole. All of these preprints are currently available on well-known preprint servers, they're open access and they will be archived through MIT's institutional repository as well. Further, the code and data are distributed through GitHub, OSF and Dataverse and will also be archived in Zenodo. You can find these links to all of these preprints on the CREOS website. Please visit this site to find out more about the project. Not only will you find these writings but you'll also find information about the team, events and coming up, calls for participation.