 Okay, folks, sorry for the delay, let's get started. It's an honor to be able to present this session, Powerful Data Science with Project Jupiter and Drupal at DrupalCOT. The talk is going to focus on the specific technologies mentioned there. It's also going to look more broadly at some significant transformations underway in the very concept of what is complete, high-quality content in increasingly the data-centric world. My name is Mike Nescott. I'm Director of Cloud and DevOps Solutions at JBS. At JBS, we do a lot of work with Drupal and web development. We also do a lot of work in research and statistical analysis. So we have a special interest in the intersection of these different disciplines and specifically in the growing field of data science. We are headquartered in the Washington, D.C. area. We do a lot of work with the U.S. federal government. But we also have offices in the Bay Area of San Francisco, Atlanta, and we have people working throughout the world for JBS. I'm currently based in Seattle right now. The takeaways of this presentation, first of all, there are continuous improvements in our ability to collect, process, analyze, and communicate data. We can extend our existing content management systems to benefit from this enriched data, specifically by using tools available through the Jupyter Notebook ecosystem and languages that are strong in data science, including Python and R. And finally, we'll see that these tools are relatively easy to learn to get up and running to engage in data exploration and content development. The term big data has almost become cliche, but several studies have shown that in fact we are producing more data at an impressively exponential rate. Behind this explosion in data is a vast array of expanding data collection and processing networks that span from the outer edges of the universe down to our own personal bodies. This data, along with these processes and capabilities and networks, along with our improved ability to process data and having more people that are trained in data science has led to the rise of data science. The term data science actually dates back a few decades, but the modern coinage of the word has been attributed to a gentleman named Drew Conway who developed this Venn diagram of data science, which depicts it as the intersection of math and statistics, programming, and substantive expertise in a particular area related to the data under consideration. Within data science there's been a lot of progress recently, specifically in machine learning, and within machine learning, which is closely related to artificial intelligence, there's been a lot of attention focused on a family of machine learning algorithms that perform what is known as deep learning, using vast networks of layers of virtual neurons to extract value and make sense of large streams of data. Within the last few years we've seen impressive advances in speed recognition and image recognition. Deep learning is a core part of cutting-edge products like Amazon Echo, and it's being used prominently in services like Google Search now. So with the rise of data science and the availability of data and our ability to make sense of it, we are living in an increasingly data-driven world. Data science now has a big role in many fields that previously were considered to be relatively low-tech, such as farming and the humanities. Data science and machine learning is an integral part of more and more applications, and data science is bringing a more prominent role in content development. This is a pyramid called the DIKW pyramid that is related to a model that's been around a while that looks at the process of taking data and then extracting information from it and making sense and content from that. Historically, the view has been that you have a lot of data, maybe, but much of it is wasted, you know, it's noise. You end up with a relatively small amount of useful content from that data. With the rise of data science and the advances in machine learning, we can imagine the shape behind the model being transformed quickly. So the new situation, the new model perhaps, is that more data is content and more of our content is data. Related to data science and machine learning and content development, there's been a lot of attention paid recently to applications such as chatbots and the use of natural language processing to construct conversational interfaces. There's perhaps been more widespread practical application to this point, however, using natural language generation to create automated news stories and automated reports in areas such as finance and fantasy sports. The data driven world, however, is not necessarily a utopia. Algorithms are being used in great many fields now and there's been several studies that have shown that a lot of these algorithms have biases in them potentially that can do personal damage to people in the criminal justice system, people buying homes, people applying for jobs or having evaluations performed on them in their current positions. Another problem that has become evident is that data science is becoming increasingly used in medicine and other sciences but these fields are still heavily reliant on the existing body of knowledge, often contained in research journals that date back years. And over the past few years, a number of individuals have taken major studies that are published in these journal articles and have looked at the documentation on the process that were used. They tried to replicate the results and found out they couldn't do it. They came out with different findings. This has led to what some have called a reproducibility crisis in science. Solutions to some of these problems may be available, at least partly, from the collection of open movements that have emerged over the past few years, specifically open data and open source software. If the code and the data that are used to create the algorithms are freely available, it becomes much easier to test them, to detect the biases that may be in them to eliminate those biases. It also becomes easier to collaborate and build upon the algorithms that are high quality. It is the quest for a development platform that is capable of working with open data and can be used in pursuit of reproducible research to produce high quality computational content that has led to the rise recently of the Jupiter notebook, previously called the ipothon notebook. It initially only supported Python, since it has expanded to support more than 40 languages. This is basically the anatomy of a notebook. It consists of a web application that provides an interactive interface that can be used to weave together text, mathematical formulas, executable code, interactive widgets, and rich media. The basic format of the notebook is it actually JSON, which gives the right mix of simplicity and functionality, allowing this platform for rich media to be easily shared in different formats and to be version-controlled using Git. And to have GitHub be used as actually a content distribution and content presentation platform for notebooks. So I'm going to actually briefly demonstrate a few examples of the Jupiter notebook that I have stitched together by cloning other notebooks from GitHub and then extending and mashing them together. So I am running these notebooks in Docker containers. Docker offers a great way to get a Jupiter notebook up and running to be able to begin exploring what the platform has to offer. There's a wide collection of specific data science notebooks available on GitHub that include not only the basic Jupiter notebook application but also a lot of data science tools in different languages pre-installed. So first of all, this is a notebook that uses open data that's available from the World Bank. Specifically, they have a data set known as World Development Indicators that show the socioeconomic progress of different countries over the years in a number of areas, health, education and welfare, for example. And what this notebook does is basically show the process of taking data from the World Bank data set, importing it into the notebook that is known as a data frame, using Python tools to process the data, to visualize it, and then to output the graphics in a dynamic format. So what I'm going to do is basically from this command from the toolbar of the Jupiter notebook, I'm going to run all the cells in this notebook. What's going to happen is that these cells, as you can see which combined text and data are going to be used, are going to execute in sequentially from the top of the notebook to the bottom of the notebook. And at this point, the data is being pulled into this data frame structure in Python. We can begin to explore the different statistics here. And then on the fly, we have these graphs generated. We can then, if we want to, take this, you know, modify this notebook here at iGitHub. We also have capabilities here of downloading it into PDF or HTML format for important to another type of content management display system. Next, I'm going to look at a notebook that is involved in an area that is a specific interest of the team we have at JBS. Okay, and so this is a, we do a lot of work for a government agency called the National Institute of Aging, which is part of the National Institute of Health in the federal government in the United States. And specifically, as part of that project, we developed an Alzheimer's disease clinical trials database. I found this notebook out there on GitHub that takes Alzheimer's disease data, related data from a study that looked at the early diagnosis of the disease and it applied different machine learning algorithms to that data. There is a article related to this subject material in the clinical trials database that we made, we built for the National Institute on Aging. And so what I did was take the notebook, pull in some content from that, the National Institute on Aging site which we built on Drupal 8, recently migrated it from Drupal 7. And in the Drupal 8 application, we used the new services friendly views support for outputting JSON, created a simple web service, and then here is the Python code that consumes the data from Drupal and is used to display it in this notebook. In addition to displaying the content from Drupal, what this notebook does is pull in that data set that has a group of data from Alzheimer's disease patients, normal controls, and it runs different machine learning algorithms on that data set. Okay, so there we go, sorry. Had cut-in activity problems for a minute there, switching out to the internet. But here are the statistics then and a chart that compares the performance of those different machine learning models. And there's a link here to a notebook that was developed by, another notebook that was developed by a researcher at the National Institute on Aging, a gentleman named Murat Vigil. And he is part of a project that is bringing together a consortium of research centers to find better ways of detecting what is known as preclinical Alzheimer's disease, of finding evidence of the disease in its very earliest stages. And what this notebook displays is the process used in, that's going to be used by that consortium, in extracting neuro-image data from MRIs and CT scans. It's going to display the different methods that are used in constructing this data. It's often essential when you're looking at scientific research in being able to look at the process from the very onset when the data is collected, because there's a lot of decisions made early in the process in terms of how to deal with missing data or data that's not clean that have potential implications down the line for the actual results. So when we look to being able to arrive at a world where more of the research is reproducible, it's essential, if possible, to be able to distribute this type of computational content along with the journal articles of the future. Just take one final look at a notebook application that is also involved in the world of health. And this is looking at a model that was developed to examine the process of disease spread and how immunization can aid in halting disease spread. Specifically, this is a model that will consider such things as, you know, if a disease is spreading in a population. What percentage of that population is vaccinated? How deadly is disease? How quickly do people recover? And then the focus is on disease within networks of social interaction. A lot more scientific research is now focused in areas such as the look at social networks to see how the interaction of individuals has impact on health and well-being of individuals. And this uses some Python tools, including a network analysis tool called NetworkX. And as we run all the code in the notebook, what we end up with is an interactive widget that allows us to explore how that disease affects the population over time based on the parameters of the model and based on the interaction of the network. So you see here, without going into too much detail, that there's 50 people in this network under consideration. They have each of them 10 connections. What we can do is potentially, if we're in the process of exploring this model and evaluating it, let's say we increase the size of that network to 500. And then we each one of those people, let's say, we say they have 20 connections. We can go back at this point in the code, rerun those cells. And what we're going to do is have that widget reproduced with the new parameters attached to that model. So you see we have a lot more nodes in that network. And we can then look at how that process plays out in a wider social network. Okay, as far as how we can integrate tools like Jupyter and languages like Python and R with Drupal, we can simply link content pieces to each other. We can, as we saw, bring in content and data from Drupal as a service. We can, as we saw, convert the notebook into PDF or HTML and attach it or import it into Drupal. A lot of folks are using the iframe to display notebooks within Drupal. This is an example of a notebook that is part of a blog that's maintained by a researcher on the open-scholar Drupal distribution platform that's used in many universities. The notebook is part of a broad ecosystem, which we saw includes GitHub. Notebooks posted to GitHub are rendered dynamically, automatically with a few limitations. And we saw that Docker plays a big role in the Jupyter ecosystem. A lot of researchers are now distributing, along with their research studies, notebooks either in the notebook format or embedded in Docker containers. Jupyter has become now a big role as a data science IE platform in all the major cloud service systems, Amazon, IBM, and Microsoft, Azure. There's a data science social network called Kaggle. Notebook is the central part of that. It's being used in executable books, in journalism. This is a notebook that was distributed by a media firm in the United States, along with a news story analyzing the Twitter behavior of a U.S. public official. I won't name that person, but if they were presenting here, the talk would likely be titled Making Drupal Great Again. The notebook is being widely used in academia and in learning. The examples we looked at used the Python programming language as a programmatic interface in the notebook. Python is quickly emerging as one of the leading languages in data science because it's relatively easy to learn. A lot of scientists who aren't programmers by profession have adopted Python. There's been a lot of package in the area of data science developed in Python. It's got an excellent package management tool. It's relatively easy to read, and there are even vibrant sub-communities within science that have developed among Python. The neuro-imaging notebook we looked at included some very specific Python packages just focusing on very aspects of neuroscience. Another language that's popular in data science that can be used in a notebook that's relatively easy to adopt to and has ISR, which was developed specifically for data science. It was born out of the world of statistics. A lot of scientific and statistical-related packages developed around R. It's possible within a notebook either to run a notebook on a specific language or you can actually mix different languages within a single notebook using a concept of what is known as cell magics. The person who's given credit for creating the Jupiter notebook, a gentleman named Fernando Perez, uses it as a tool for accelerating the pace of scientific research. Historically it's taken a long time when someone has a new idea to get that actually published in a journal article. There's a long laborious process in place for taking a scientific idea, writing it in a journal article, submitting it and getting reviewed. The notebook as part of an open agile content publishing workflow, potentially would help accelerate that process. You can envision a situation where you have a notebook that's developed by a researcher in a lab or by an investigative reporter. It's pushed into a data or code repository so it becomes immediately available for evaluation and for collaboration. Eventually it's published alongside a notebook in an open access journal. This new model also leads us to consider then with the rise of data science and the popularity of the Jupiter notebook if there may be a new model emerging for scientific content and content general in a data-centric world. The idea here is that high quality, complete content in the future may not only represent the text and the graphics and the references to the data and the methods that were used to produce that text, but also the whole computational environment that were used to create it becomes part of the core content itself. This is a concept that's only beginning to emerge. It's going to be very exciting to see where this develops in the coming years and very interesting to see how we in the Drupal community can use this productively alongside Drupal itself. With that, I'd like to thank you for hanging around so late and attending and invite you to complete devaluation on this session. Thank you very much.