 So, again, welcome to edition number two of the machine learning workshop. We're going to have a total of four modules that we're going to be presenting. Some of it's going to be pretty heavy for you. And I think it'll be important, obviously to have some breaks. Certainly feel free to ask questions through Slack. We're going to presently manner mature that we're going to go through. We'll try and save the verbal questions for the end of the lectures. Normally, if we were doing this live, I'd be quite content for people to interrupt. But we found that just because of the conditions with zoom and it's kind of hard to do. As I guess an interactive interaction is as I really like. So, Francis has already talked to you a little bit about the creative commons. These are the same slides I'll show for every beginning section of different modules that we'll present today. So this is really just the first part and introduction to machine learning. It's kind of getting you acquainted with what we'll be doing with my lecture style with the kinds of slides will be using the image on the left. First, I thought it was actually some kind of early fish, but I guess that's a brain thinking. So the structure for today is modules for modules today for modules tomorrow. We're going to get into the guts of machine learning. So this is the introduction of today. Some of you might find it a little tough going, but hopefully by tomorrow things will be a little easier. So this is an introduction that will be kind of light. We'll probably finish a little earlier and so we may shift our schedule a little bit to give us some time towards the end of the day, where we're kind of short. We'll try and stick to our times as best as we can. I talked about before this asking questions, try and use Slack and the TAs will be able to answer things. Obviously if there's something strange that's happening where I'm frozen or slides are frozen or something. Hopefully you guys can communicate by just shouting. We will have breakout rooms as Francis talked about. And so this will allow people with similar questions or similar challenges to work with the TAs one on one or one on three, one on four. The course has actually been developed over two years. There's a lot of code development we had to do with this. And that we have both Python code as well as our code has been added this year. In total we've been working on this course for about two years. While there's certainly some excellent books and there's tools like Code Academy for teaching you certain elements of machine learning or programming. The stuff we're going to present to you today I think is very much aligned with what we do in biology and bioinformatics. It's certainly looking under the hood and hopefully giving you some insights so you can understand a little bit more if you want to venture off and do some programming and development. The people have been developing a lot of the code for the last eight months are life in Louisa. I know everyone would prefer to pronounce life's name as leaf, but it's actually life. They'll have lots of answers for you. They know the code inside out much better than I do. So they're great experts to have for this. So I know people have signed up for this course where we actually I think have more than 60 applicants for this so 30 of you who joined us today are a select group. Certainly machine learning is hot, but I think it's also important to sort of temper your expectations. We're going to be experts at machine learning at the end of the day. We're not going to cover everything in machine learning. It's typically a topic that takes very skilled computing scientists multiple years to really get familiar with so we're trying to do something in two days that typically people will take four to six years of their lives to learn. We're going to look into at least for today the first module is to look at differentiating machine learning from conventional computing and from artificial intelligence. I'm going to show you some everyday examples of machine learning. I think you'll be surprised to see how how ubiquitous it is. We're going to look at a few examples of machine learning and bioinformatics and genomics. Then we're going to talk about the standard machine learning workflow and if that's the only thing you get from this course then that's what I'd hope you concentrate on it's it's fairly simple. And then hopefully if people have done the advanced pre reading and taken some of the recommended material look through it. We won't really go through the co lab and class website and code repository will assume you've done that. We may have a question at the very end of my first module just to find out how many of you actually able to get through some of the pre reading material. So as I said machine learning is widely used. It's a difficult subject and I said even for people with backgrounds in math and computing science it often takes them four to six years to become really experts. So we're just trying to give you a taste of machine learning and hopefully inspire you to learn more on your own by giving you all of the code. If it's been written both in Python and are those of you are somewhat familiar with coding would be and should be able to reuse this and we're hoping that can be I think part of the inspiration for this course. So we've chosen Python in part because historically a lot of freeware collaboratory with software was written in Python for machine learning. We also have our code and we know a large number of people in bio dramatics use are and are comfortable in our. So this course is bilingual if you want. But we will be mostly working on Python. And then if you want to go to our the code is there and you can compare one on one and this way some people might be able to learn Python by knowing our advice versa. By using the collab environment, you don't have to install Python or Python environment, the same thing with our, you're using it through the collab environment. So just some definitions, learning versus machine learning. Learning is something that we all do this is what we're going to be doing today. So we are organisms, and what we're going to hopefully do after this today course is we're going to improve our performance in partly coding and perhaps in understanding machine learning by going through experiences by both listing the lectures and by doing some labs. Now machine learning is different. It's not done by living systems it's done by computers. It's a branch, a sub discipline of artificial intelligence or AI. And essentially what happens is a computer automatically improves its performance from experience. In essence, the computer develops programs. It writes its own programs that can be used to make predictions or decisions. And the neat thing about machine learning is you're not coding it explicitly to make those predictions or decisions. You're building a framework that allows it to do or to generalize and to make predictions and decisions and that's, I think, what's really special. Now machine learning is almost as old as computing. Art Samuel was the guy who actually defined it in 1956, which is only about 10 years after the first computers started appearing. And he defined it as a field of study that gives computers the ability to learn without being programmed. And that definition is still used today. So to distinguish machine learning from traditional computer programming, which is what many of you have done or some of you have done. What a computer is is just a glorified calculator. It takes some input could be sets of numbers or sets of text. It takes those inputs to a program where there's addition or subtraction or division or comparison, and then it produces some output. So key for traditional computing traditional programming is that there is a defined program in machine learning. What we do is we provide not only input, but we also provide output. We provide the question and we provide the answer. And by providing both of those in multiple instances. The computer is then essentially acting as a learner. If you want to think of it, it's your brain. In this case, this is taking the examples, learning from experience, seeing the input seeing the output looking for patterns. And then instead of kicking out an output. The learner kicks out a program. The program can be reused over and over again with different types of input. And presumably it's now able to do or have learned whatever skills it needs. So that's, that's a fundamental difference between programming and machine learning. A more explicit example, as I said, typically a computer is a glorified calculator. So it can take one plus one, and it can run it through the adder program that's found in every calculator and every computer. And it'll tell you the answer, which is to know in machine learning what we would do is we give it lots of examples. So on the bottom you can see sort of an addition table one plus zero equals one one plus two equals three two plus six equals eight. All examples. And so that's both the input, that's addition, the output is the answer. And the learner, the model learns a concept of addition and it produces essentially a model for addition. So now we have a computer that can perform addition but it didn't, it wasn't explicitly programmed to do it just simply learn through the experience or observation of those examples on the left. Now another example where machine learning really shines is in character recognition or image recognition. On the left, we have four different ways that people write the number two. Most of them I think are by physicians because doctors have terrible handwriting but this is an example of where if you try to put something like the number two into a computer program, it would have no idea what to do. And traditional programs are terrible at recognizing images or patterns. But by giving, in this case, the input, which is a person's writing or multiple people writing the number two, and then telling it that this is the number two. And learning from those examples, then you developed an image recognition learner, which recognizes the number two in just about any style or any way that someone could write it. And that's a powerful approach that machine learning offers and one of its real strengths. So for comparing machine learning to conventional computing, what the conventional computers do is that they do tedious tasks faster and they do them more accurately. And that's why we like to use computers a lot. They're great for calculation, adding, subtracting. There's spell checking where essentially the computer looks up a word and compares the word to a table of known correctly spell words. So that's a simple calculation or comparison that conventional computing can do. Now machine learning will typically do things that are much more difficult than the tedious tasks or things that are not possible with conventional computer. While spell checking can be done with almost any computer program, grammar checking is a much more difficult process. And so grammar checks that you might have on your computer represent an element of machine learning. Recognizing images interpreting speech through Alexa or Siri or other language recognition tools. Those are examples where machine learning shines. Now there is a difference between machine learning and artificial intelligence. AI is an older field and machine learning is considered a subfield. Both still require lots of data. Some people think AI is fundamentally different than ML. I don't think so really. But anyways, that's a bit of a debate. AI emerged in the 60s with the development of expert systems where people wrote down large numbers of rules, lots and lots of lookup tables and created vast tables or quantities of data. And this was used to help solve chess and checkers. These are sort of definable games where if you have enough data, you can beat just about any human. So machine learning doesn't tend to use sets of defined rules or lookup tables. It uses statistics, it uses probability, it uses optimization. And we'll see examples of these, but those probabilistic methods of determining or calculating or creating models that allows you to make both predictions decisions. It's used in something like face recognition as opposed to playing chess or checkers. Many of the people who actually are in the field of machine learning are actually by training experts in AI. And so now many people in AI are comfortable doing machine learning almost all the time. So I think this is just some images to distinguish between machine learning ML and AI. So ML is in terms of facial recognition, which you can see on some more advanced cell phones or cameras in the cell phones, or if you've got a computer to recognize your face when you log in. AI, this is an example of Deep Blue, which is a program that beat the world champion, Gary Kasparov and chess. So this is more than 20, almost 25 years ago now. This is an example that used expert rules, opening moves. If anyone saw the Queen's Gambit, essentially what Deep Blue did was collect all of the information about standard chess games and chess moves that many experts have developed over the last century and it played those out and it consistently beat humans. An example where AI and machine learning were sort of combined together was in a Jeopardy challenge. Some of you might have seen this. Some of this may be brand new to you, but essentially, top two human performance of in Jeopardy, Ken Jennings and Brad Rutter played off against the computer called Watson. Watson is in the middle. In case you didn't recognize it as a computer. And you can see there that it's clobbering Ken Jennings on the left and Brad Rutter on the right. So what Watson did was essentially became a question answering system. IBM threw almost a billion dollars into developing it. So I think in the end it won a million dollars so the payoff wasn't great. A thousand people were involved in it at different times. So it used natural language processing that that's part of the machine learning. It used a lot of instrumentation retrieval and automatic reasoning which is a form of AI. It used vast collections of data, Wikipedia and Cyclopedia dictionaries, the Tsarist's Newswire articles. So huge resource of data to help figure out some of the questions that were being asked. So in terms of the Jeopardy challenge Watson did really well, but in the last 10 years they've continued to develop Watson. So it's now moved from the hard wired computer that you saw in the picture earlier to the cloud. And it's now able to see and hear and read and talk and taste and interpret recommend it has many functions that are quite exceptional. And it's being used in a variety of applications. The Jeopardy challenge winner to something that's very close to what science fiction writers would imagine this is a real intelligence is now embedded in Watson on the cloud. Now there's also a difference between machine learning and data mining. Both fields do use a lot of data and both can be used to predict in some way. So machine learning tries to predict from no knowledge. Data mining focuses on the discovery of previously unknown knowledge. And we're not going to get into that in this course. But it's one of the ways that people differentiate between machine learning and data mining. Another hot area is deep learning. Deep learning is a sub-discipline of machine learning. And basically what deep learning uses is artificial neural networks or ANNs and very deep multi-layered artificial neural networks and these are called deep neural nets. Essentially the deep neural nets have many hidden layers to them. They mimic the structure of the human brain more than traditional ANNs. They tend to learn much more complex patterns and to handle tougher problems. They've also changed the architecture of artificial neural nets to include things like recurrent neural nets, convolutional neural nets, and deep belief nets. So these ones are constantly evolving and I think the area of deep learning is where a lot of excitement is and where some of the most significant advances have occurred. Interestingly, two of the most important players in the field of deep learning are actually Canadian. Jeff Hinton is at the University of Toronto and Joshua Benjio is at the University of Montreal. Jeff is a little older than Joshua. Joshua learned his machine learning, I guess, from Jeff Hinton. Both won the equivalent of the Nobel Prize in computing, the Turing Award in 2018. And they're now among the most cited scientists in the world. Now there are three different approaches to machine learning, supervised learning, unsupervised learning, and reinforcement learning. They might seem subtly different, but in terms of supervised learning, that's the most common. And that's sort of the example I was giving where you're giving both inputs and outputs and the outputs are the gold standard truth. And then the point of supervised learning is to learn the rules to map inputs to outputs. So learning the model of addition, learning to recognize the number two unsupervised learning is essentially giving you unlabeled data. So you just give it a whole stack of different numbers and hope the computer can start recognizing that first that it's supposed to recognize these numbers and written numbers. So it's trying to find structure in the input data. So that's much more challenging, but this is also something that people do quite well. And it's one of the areas of machine learning where there's active development. Reinforcement learning is a variation of supervised learning. It's trying to solve a problem. It's fairly given the outputs, but you're given feedback to maximize the rewards. So if you want giving praise, so close but not right, getting closer, getting closer. And that's what reinforcement learning essentially does. So it's similar, it's similar to supervised learning, but slightly different. Machine learning uses what we call models. And so I'll use this term and it's a formal term in machine learning. And the models are things like artificial neural networks. So these are the tools that create or write the program. So they're the ones that take the input and the output and the model runs or generates the program. So the simplest models are decision trees or collections of decisions trees called random for us. The next most complicated ones are the artificial neural nets. Probably the most complicated ones are hidden Markov models. Relatively simple ones are genetic algorithms. And the core vector machines and Bayesian networks are other examples of models that can be used to learn different properties. So applications, whether it's supervised unsupervised or reinforcement, you can see them in things like self driving cars to play different games, whether it's poker or now more advanced levels of chess face recognition fingerprint recognition, stock trading spam filtering gesture recognition, which you might see on various things like play stations, speech recognition handwriting recognition, analyzing radar images, medical diagnostics and then of course, why most of you are here is in the application of bioinformatics. Examples of everyday machine learning are in voice recognition. So whether it's on your cell phone or whether it's on your at home computer, like Alexa or Google Home. What's done is that when you speak the analog sound is converted to a digital form and that digital set of sound waves is then run through various pattern recognition. And that's processed into a set of words parsed. And then Alexa or Siri asks, what can I help you with or gives you your answer. Typically, the best speech recognition tools either use hidden Markov models, or deep neural nets. They parse or segregate the speech, they get rid of the noise they get rid of the ums and Oz, the hesitations that are typically found in human speech, and they parse out the words, they also parse out the meaning of those words in the proper sequence. If you've ever had your credit card number compromised or stolen machine learning is one of the ways that they're able to detect that. Some of you may have even got calls when you've been traveling and they think that you're doing something unusual. So what's happening is that in credit card fraud detection or prevention, they're looking for unusual purchases at unusual locations. So they're looking or they already have a good deal of information of your habits, what you'd like to buy where you'd like to buy. They have collections of what you have bought so they can look at historical data. They'll build out pattern recognition. They'll also use information about yourself to create sort of a profile. So each fraud prevention tool actually somewhat personalized for you so it's sort of like, you know, precision health but precision health for your credit card. So this is an example of everyday machine learning and it's actually quite effective. About 10 or 12 years ago, there was a really interesting challenge for Netflix. I'm sure many of you've been watching TV or videos through Netflix because of COVID. Netflix has gotten so popular because it's pretty good at recommending shows that align with your interests or general viewing habits. And what Netflix did 12 years ago was offered a million dollar prize contest to see if they could come up with a program that could improve recognition or recommendations for users. And the result was that several groups actually developed very good ones, but one was significantly better and it was better than the algorithm that Netflix is using and of course since then they've continued to improve it. What they did was they used all of the data. They had these millions of subscribers, millions of ratings, millions of searches. Huge amounts of data that then could be mined, patterns could be detected, and the result was essentially a recommender that was very accurate and is still continually being improved upon. You can have a cell phone with a good camera or if you have a conventional camera, at least digital, most of you have tools to essentially do autofocus. And the autofocus element typically recognizes faces. It recognizes eyes and nose and mouth and the elements of shadow. And so with that it can then focus on those faces and not onto something else. And for face detection, it's used obviously for things like security passes with computers or in secure buildings but it is also used in your camera. And again it uses machine learning elements to extract the features. You can see many thousands or millions of faces to look at those features and identify them when the head is in different positions or with people having different skin colors or skin tones. Machine learning is now used for the development of autonomous self driving vehicles. It's still a hot area of research, typically reinforcement learning is key for the development of autonomous vehicles. Other things that are required, path planning, knowing where to go so you don't go off the road looking ahead. Looking at confusing roads, the lower right corner is an example of a pretty beat up road. It's not well marked. This is not challenging for a human but very challenging for computers. Now, those are examples of machine learning in, if you want, everyday life. There's lots of applications of machine learning and bioinformatics. We're going to give you some examples of machine learning to be used in things like secondary structure prediction for proteins and gene finding elements of sequence motif finding. It's also been widely used in SNP and GWAS analysis, disease diagnosis, DNA sequence, spectral analysis and cheminformatics, drug design, drug discovery. Anything where you can get large amounts of data and where the problem is really hard to calculate where conventional programs fail. Now I've been involved in machine learning for a long, long time. What I got involved in was back in 2004. We wrote a review in 2006 called machine learning and cancer prediction and prognosis. I guess it turned out that was about five or six years ahead of its time so no one noticed it until about seven or eight years ago. And now it's being widely used. We've applied machine learning in the areas of protein secondary structure prediction in calculating mass spectra and interest in one of the very first applications in the 1960s was a program called Dendral to calculate mass spectra. Fifty years later, we revisited that and in fact with the new computers and new approaches to machine learning, we're actually able to essentially solve that problem. We've also applied machine learning to genome wide association studies. We were really interested in understanding a little bit about single nucleotide polymorphism panels and calculating disease risk prediction from SNPs. Unfortunately, a lot of the public SNP data doesn't have the needed information for calculating risk profiles. And so we had to figure out a way of essentially regressing the data and using both support vector machines and random for us to calculate multi SNP risk scores or product curves from GWAS data. So gain an application where you can use machine learning to do what at the time seemed to be impossible. Machine learning has also been used in many other areas. This is an example of being able to do SNP types, SNP typing, looking at hundreds to thousands of SNP variants with hundreds of categorical phenotypes. This used non supervised learning something called swarm clustering. It's a neat idea. And I guess as an example of how you can skin account in many different ways. Some of you have probably used the Minion Oxford molecular nanopore. This is widely used technique now for sequencing bacterial genomes. These are truly sequencing on a chip. You can plug it into your laptop. On the lower right is an example of the nanopore output. When you have sequencing DNA with the nanopore system, DNA is pumped through a protein motor, sort of one base at a time. And if it's configured into a membrane and an electrical output is read, you essentially detect changes in both the speed and the electrical signal that comes out as the DNA is pumped through this nanopore. And what you can see is that there's interesting patterns, but those are supposed to represent A's, T's and G's and C's. So it turned out it was too hard to come up with a rational program to convert those outputs into base calls. What we had to do was use hidden Markov models and recurrent neural nets, and they trained on thousands and thousands of examples or exemplar data to actually get the sequencer to work. And as it continues to train the performance of these Minion sequences actually gets better month to month and year to year. So there are examples of deep neural nets for predicting things like sequence specificities for DNA and RNA binding proteins. Historically, and this was done more by conventional computing, looking at position specific scoring matrices and things like that. But with the advent of machine learning where you've got large collections of DNA and RNA binding proteins and the sequences that they bind to. So they've extended the motif scanning to include neural nets, which now are a little smarter. And by iterating and training the deep bind models become much more accurate than traditional sequence motif techniques. Deep bioseq another one that uses deep learning again convolutional neural nets to analyze RNA seek data. It doesn't require sequence and sequence pre processing doesn't require a genomic alignment. It can work directly with fast cube files. And they've been able to adapt it to single cell sequencing and chip sequencing. Machine learning is also used in in CRISPR target design. There's a computer company called desktop genetics that has been able to use machine learning. They've gained from large collections of experimental data of things that work and don't work with CRISPR. And machine learning is able to essentially design better smarter more useful CRISPR target sequences for gene modification. They've also extended machine learning beyond sort of the molecular range of looking at sequences or RNA seek or binding, but in the area of health, and particularly with cancer. And particularly with the integration of not only genetic tests that we do with whole genome sequencing of cancer cells but also imaging, whether it could be in this case mammogram results. There's genealogical data looking at patient history. And by linking to electronic health records genealogical data, as well as genomics data, they're able to come up with much more robust risk predictions with a person. Initially being diagnosed through a mammogram about whether they may have cancer or not. They've maybe had your genome analyzed by companies like 23andMe or Ancestry.com. 23andMe has been off from Google, they used data from about 600,000 people. And data, not only from the GWAS data but information on people's body weight lifestyle to essentially come up with a way of predicting a person's weight. And now they've got essentially what they will do in terms of whether they will gain weight, whether they're at higher risk of gaining weight later on, and how to manage their, either their propensity for gaining weight or for being underweight. Applications in tumor genomics. Again, this is something that's doing a deeper dive, especially now that we can do large scale next generation sequencing and we can look at copy number variants and single nucleotide variants. This increases random forests and looks at features to look at things ranging from strand bias to debatch effect and what the use of machine learning in this approach to analyzing large scale tumor genetics improved performance significantly over what humans could do. These are not just exclusive to genetics or in protein sequence analysis. We often use chemistry to analyze blood spots for newborn screening. If you're under the age of 30, you've probably had one of these tests that you didn't know it because it happened a few hours after you were born. But the work is done, blood is collected and the blood is sent into a mass spec to look for certain patterns, higher abundance of phenylalanine to indicate phenyl ketoneuria, higher abundance of methionine. Now these signals are kind of noisy and using machine learning to recognize the signals and to measure their abundance actually improve the overall performance for recognizing whether it's PKU or hyper methionine. There's a tendency for humans to over predict or at least predict too many. So there's a number of false positives. So machine learning reduced those by significant margin. So that's sort of an introduction if you want to machine learning both in terms of what it is, how we define it, its history, how it's used in everyday examples and how it's used in bioinformatics. Now this is maybe the most important slide of the day and maybe of the whole course is just sort of outlining the machine learning workflow and something you'll need to remember and need to understand. So it's six steps. The first step is really to define the problem and to come up with a suggested solution. We'll explain how you choose a good machine learning program or problem and how you can propose a decent solution. The second step is to construct your data. So machine learning needs both input and output. So you have to have generally a large data set. That data set is usually pretty messy. And in the third step, you spend a fair bit of time transforming your data set. We'll explain that. That may be formatting in a way so the computer can read it formatting in a way so that machine learning tools can read it, but also normalizing or scaling it. There's also an element called feature selection, which is another thing which is to target certain types of data and discard irrelevant data. So the first three steps are really data dependent. And this is often where people tend to neglect what's needed for machine learning. The next part is choosing and training a model. This is what everyone wants to do. So do you choose an artificial neural net you choose an SVM to do something in deep, deep learning. But once you've chosen your model, you train it. So the training is, you know, providing examples of input output. But training is only so useful eventually the computer has to do a performance instead of like, you know, the dress rehearsal. Eventually you have to do the live play. So in orange, that's where you test your model and you check its performance and by checking both how it performed during the training and checking how well it did during the live performance. You can assess whether it's robust or sufficiently trained or whether it needs to go back to school and retrain. But if it does really well on the test, then it passes and it's actually ready to be used in many other examples. And so the last step is say, once it's passed the testing and training phase, it's graduated and it's not ready to make predictions or make decisions or to model or to do regression or whatever you decided it should do. So those are the six steps training and testing or something and I will talk about over and over again. So when it comes to choosing a problem step one, generally with machine learning trying to do something that's sort of an unsolved problem, something that's interesting to you to your supervisor to your own research. Typically, you should try and choose a machine learning problem where it can't be solved mathematically. So don't try and use machine learning to come up with a better tool for addition or subtraction that's already been solved. Something that is very difficult to do something that requires special knowledge. So it might be that over years you've learned how to do something really, really well, but you can't seem to teach it to anyone else. Then what you might want to do is essentially try and develop a machine learning tool to do what you do. And so it does it either better or more efficiently or faster. Typically, for machine learning, you want to have a problem that's more focused on finding a pattern, or for classifying things, or for in some cases doing what we call regression or curve fitting. So again, the examples I gave you from the bioinformatics set are all examples where people are trying to find patterns, better ways of classifying things. In other cases, improving the regression calculations. Another thing is that in order to have a workable machine learning problem, you need a lot of data. You need training data or exemplar data where you have the input and the output. So someone has partially solved it or their case examples where it solves or where the answer is known or maybe some instrument has been able to measure the answer for you. So if you have enough data, training, testing data is absolutely critical. So these are the four major constraints in choosing a problem, and not every problem will be able to meet all four. And if you can't meet all four, then machine learning isn't the thing to use. So, once you've chosen a problem, once you've kind of got an idea of how you think you can solve it, presumably got some training and testing data where you've got, you know, the answer. Both the input and the output, then you construct your data. So the data obviously should be reliable. You have to have a gold standard answer as part of your output. Typically, the data has to be labeled. So that's for supervised learning and supervised learning. We can get away with unlabeled data, but our focus for today and tomorrow is supervised learning. So we can have either categorical data, so red, yellow, green, nominal data, which could also be red, yellow, green, but it might be named data or numerical data. So it could be sets of numbers. And as I said, the labels have to be gold standard answers. They have to be correct. We have to have some kind of instrumental confirmation that this is what's really there. Ideally, what you also want is relevant information that probably contributes to the phenomenon. So if you're trying to predict DNA binding motifs, information on the phase of the moon or the astrological signs that a person who was collecting the data had is probably irrelevant. And so this is where you need to have your insight, your knowledge about what are the most important or likely contributing features to say DNA binding. And this is, again, something where people have a tendency to say, you know, I've got a pile of data. It's filled many drawers in my file folder or I've got, you know, gigabytes of data in my computer. Here, take it and tell me the answer. It's sort of like someone saying, you know, what's what's the meaning of life. The data has to be structured. It has to have useful answers. It has to be relevant to the question being asked. So when you construct your data set, you have to have a training data set that's inputs and outputs outputs are gold standard, a testing data set, which is also set of inputs and outputs. Sometimes it's a portion of the training data set so you just hide it off. And then again, many people, especially physicians in the medical community also want something called validation data, which is essentially a third data set that is used to show that your training and your testing was really good. Now, many people will ask me and this is a question I've had to field for many years is what's the amount of data you need for machine learning. And there is no single right answer. It depends on the type of problem depends on the quality of the data depends on noisy. The data is how noisy the answer is. Typically, you need about 1000 examples. For average problems in machine learning, better cases usually have 10,000 to 100,000 examples. So that's an average problem. If you're trying to do something that's really difficult, like text translation, or anything that requires deep learning. You have to go up by almost another order of magnitude 100,000 to a million examples. So, you know, thanks to things like next generation sequencing or thanks to the rapid developments in protein structure determination and large numbers of known protein sequences. Some of these much harder machine learning problems are now solvable because we are in the range of 100,000 to a million examples. So, I've gone through the first two steps defining your problem constructing a data set the next one is to try and try and transform your data set and to select those features. So when you have big data, usually it's, it's pretty messy. There's sometimes repeats. Sometimes there's missing value so you're missing values have to be imputed or filled in. Sometimes things have to be reformatted. Sometimes you have to identify outliers sometimes you have to remove sparse classes or group them together into groups that are more meaningful. So this is called data cleaning or data cleansing. Another thing that's often done is converting categorical or nominal data, things with names to numeric data, because computers read numbers they don't read names. So this is something called one hot encoding, which we'll hear about in neural networks. There are other ways of reformatting data. Another important thing that many people don't know about or don't realize is that a lot of data has to be normalized. That means making it Gaussian. A lot of biological data is very skewed. And so taking log transformation or other transformation range scaling helps improve enormously the performance of machine learning program. So that's called data transformation. Occasionally, you know you have your data set and you look at it and say well I think you're missing some things here I'd really like to have that. So in some cases people add features to their data set. Sometimes they will calculate ratios. Sometimes if it's sequential data they include the days of the week. Because there are tendencies that happen during the days of the week like weekends people don't work and so if you're trying to measure mobility or predict mobility information, knowing the day of the week actually helps a lot. So adding obvious relationships or intuitive expectations to your data set can sometimes make a real difference. Another part is feature selection or feature engineering. This is where you either select features or remove your relevant data. And again this sometimes is done through the machine learning program. Sometimes it's done through an intuition. So I talked about one hot encoding. This is widely used in machine learning. It's basically converting categorical or named data like red, green, blue into a binary representation. So you can see a table which makes on the left side it makes sense to us because we know colors. But we can convert it to a different table where now we still have the identification of the objects. But now our label is put up in the header for essentially the rows. And we've just indicated with one or zero, whether the object has a red color or a blue color or green color. So that makes it much more readable. And that's called one hot encoding. Fixing skewed data, showing essentially some skewed data on the left. It's not uncommon to see that in image analysis data, microwave data is one example. Audio data, sound data, often has very skewed distributions, but by taking a log of that skewed data, you can come up with a really nicely looking normalized distribution and normal distributions of data numbers, like information is much more easily classified and processed. Feature selection, we talked about this as well. As I said, not all data is relevant. And by reducing the amount of data that's used in training that can greatly speed up the algorithms. Some machine learning programs can take literally weeks to finish training. And if you have too much data, it can take literally years. And that selection is a way of getting rid of those features that contribute the least and adding or keeping those that contribute most. Sometimes if you include irrelevant data, or just drag that irrelevant data along, it actually reduces the performance of the machine learning model. The model that gets away with the minimum complexity is often the best and often the most accurate. So I guess visually, if you can think about feature selection at this, where you've got a whole bunch of, in this case, seven different features and different colors, we choose to get rid of certain numbers of them. And through that, we end up from seven down to three. And that, as I said, can be done automatically through a variety of programs. It can also be done manually, where someone just simply uses their intuition or compares performance with different combinations of features. Plenty of examples where people can basically get away or use too few features. So if you had essentially the data on the right, but had somehow neglected to include seven or eight points, it would get the data on the left. And of course, if you see two points, tendency is to draw a line, which is not always correct. So if you have too little data or not enough data, you can come up with potentially the wrong interpretation. So these parts of defining the machine learning problem, constructing your data set, transforming and selecting features, these are actually the most important parts of machine learning. And they're the ones that are typically the most neglected. And the ones for reason for most failures in machine learning. The next parts are methodological. And we're probably going to spend a fair bit of time about, you know, the different models. But in the end, a lot of machine learning models are pretty much equivalent. They do about the same. Some do slightly better, some do slightly worse. You're never going to have one where, you know, something in performance is at 2% and you change to a new model and now it's 99% correct. You're going to get something that maybe predicts 75% with one model and maybe 79% with the other model. They're subtly different and yes, 79 is probably statistically better. But if you have not constructed your data set correctly, if you hadn't transformed it, if you hadn't got your selected your features, those are the things that make the difference between a 2% performance and a 99% performance. So it's the back end stuff that's really important. However, it's the front end that I think we'll focus on because that's where I think most people are more curious about. In terms of choosing your model, we've talked about these already decision trees, random forests, neural nets, hidden markup models, support vector machines. You don't know which model will be best and many people who are in machine learning will try multiple models. And as I say, usually you'll just see subtle differences between the models. Some are significantly better because model was really designed for something else. So knowing a little bit about what different models are capable of or most suited for certainly helps you avoid wasting time on a model that's not really suited to the problem. But these days with machine learning and deep learning is typically half a dozen models that'll give you almost equivalent performance. So I mentioned a few of them and I'll just briefly go through them and we'll talk about them in much more detail later. So the decision tree is actually the simplest machine learning algorithm to understand and to implement. And on the right I'm just showing an example of the survival of passengers on the Titanic. So the Titanic is a giant ship that went down about 100 years ago, and the rule of women and children first was what was followed with the Titanic. Most women survived the Titanic. But they also were trying to choose young people. So if you were male and you were sufficiently young under the age of turns out nine and a half. Then you had a better chance of survival but you had an even better chance if there were a number of people in your family was sufficiently large. That's sort of a decision tree that was identified or learned and some of the numbers and the less than greater than signs are messed up on this one so we'll correct that but anyways what's happening in machine learning decision tree is that you have a list and the passenger list of all the people who survived and who didn't survive and their age and their gender and their number of family members. And the computer then learns to recognize which decisions were made by the captain or the crew, deciding who lived and who died. And so this decision tree is what was learned. The decision tree was first applied. The captain didn't have this table to decide who would go on to a life vote and who wouldn't, but this is what was learned afterwards. And the decision about male females, probably obvious but you know what was the cutoff age was at nine was at 10 was at 11. Well it turned out nine and a half. How many siblings and spouses was at three, four or five. This is what the machine learning decision tree learned. And as it consists of a series of looks like an upside down tree, the roots at the top, and it has branches and the blocks are squares or rectangles are called leaves, the lines are called edges. And of course what you do is you combine many decision trees together and having a combination of decision trees and where you either average or choose a majority voting from the different decision trees gives you a final result. The random forest is the step up where you've got instead of one decision tree but multiple decision trees that the computer has learned they're combined, because you're combining them it's called an ensemble method for learning. And you can use it for both classification, but you can also use it for things like curve fitting or regression. Another concept with random forest is that decision or prediction by committee is usually more accurate than a single tree and that's true. Sometimes called meta prediction or meta critics and so on. Artificial neural nets are another area that we're going to focus on and these are really the precursor for deep neural nets and deep learning. It's difficult because they try and simulate the activity of the brain and brain is a pretty good engine for doing pattern recognition. We have typically with an artificial neural net we have nodes those are the circles. Those nodes sort of represent neurons and then the lines are connections or wait matrices, and those represent sort of the axons extending from the neurons. So if we're going to make decision trees and random forest, artificial neural nets can be used for classifying, but they can also be used for regression or linear nonlinear correlation. Hidden Markov models or HMMs. These are called probabilistic graphical models and they're designed to model sequences connections of events. If you want. They are Markovian, which is a process. Essentially trying to model ones that can't be observed, which is why you come up with the word hidden Markov model. It's a probabilistic one where you have what are called admission and transition probabilities. And the probabilities are attached, which I guess you can think of them as weights, not unlike the neural net. They have to be optimized. The optimization is done through dynamic programming, which somebody might know is commonly used in sequence alignment. Hidden Markov models are sort of losing popularity because you can now get the same performance or even better with things like long short term neural nets or LSTNs. Hidden Markov model is important for historical reasons because it's useful for predicting time events, things that occur over time. So that when you speak, you produce sound that has varies over time. If you're looking at the stock market as it goes up and down, this is a value of the stock over time. If you're looking at DNA or protein data. Those are collections of letters that are essentially sequential. Maybe not in time, but in order. So sequential events or sequential data or temporal data is best handled through hidden Markov models. Support vector machines. I think it's got to be one of the dumbest names. They are not machines. They are algorithms. And essentially it's a way of doing linear discriminant analysis. It's a form of gain regression. The trick with support vector machines is to use kernel transformations or kernel tricks or you transform the data to find an optimal boundary between hyperplane. It's essentially a form of partial least squares discriminant analysis, if any of you know about multivariate statistics, but it's sort of bending transforming the data to find an optimal way of separating it. We aren't going to talk about SVMs in this workshop just because there's not enough time. So once you've chosen any one of these models for classifying for regressing for predicting. The, you have initially had to train it. And typically we talked about the amount of data that you typically need to train it with, and then you have to test or validate the model. So when you're doing machine learning, or when you're doing classification of any time, any form, you have to ensure that your model is not over trained. You don't want to underfit by using too few parameters and you don't want to overfit by using too many parameters. In the bottom I'm showing sort of examples where you've got some data and where if you want regressing. So the orange dots, we're trying to fit there. So if we do an underfit, we would just draw a straight line and say, well, that kind of works. If we did an overfit, we'd essentially try and connect every single dot to the other. And so we get the squiggly line. So on the right, it's overfitting on the left, it's underfitting in the middle, which is sort of slightly parabolic, it sort of crosses all the way through between those points. So if, if you underfit or overfit model isn't good for predicting new data. You tend to underestimate the error. If the model is too elaborate. It essentially starts modeling noise. And so that also makes it particularly sensitive to the input data set. So testing and validation essentially allows you to sort of calibrate and make sure that you're not overfitting or underfitting. And that's a bit of an iterative process, and people will have to spend sometimes a few days sometimes even a few weeks just making sure that they haven't over trained or under trained. Over training under training is very much a function of your data set. The most common mistake that people tend to make with machine learning or with classification tools and multivariate statistics. It's overfitting. It's using too many parameters. And so underfitting is rare overfitting is common. And many people report spectacular results because they didn't really test or validate their data. They trained on their training set and they tested on their training set. So that's kind of like taking a test, getting the answer key, and then taking the test again and saying, aren't I smart. I know that the stuff. So, you know, it's not really a measure of your knowledge, especially if someone's given you the answers for the test. So to prevent this over training. There's lots of needs to use external validation sets. You use n fold or k fold cross validation to use leave one out or permutation or all four methods. So, one way of making sure that you, you don't fall into this trap of overfitting is to take your data and to split it into two groups. So if you were an experimentalist and you've spent, you know, half your life collecting this data set. So if someone tell you will spend the other half of your life collecting another data set so that can be your holdout or your validation set. Now take your data that you already collected and split it. And the split typically is sort of two thirds one third two thirds of your data is used to train to get the model up to speed, and then one third of the data, which is called a holdout data set is used to assess the model. So that that test data can't be used in the training set, it has to be invisible has to be kept away. So it's only at the end of the training that you're allowed to do the testing. Sometimes it's this one third two thirds approach is not ideal it may be a function of the data size or the way that things are breaking out in terms of being unevenly distributed. So you can do something instead called k-fold or n-fold cross validation. So maybe instead of splitting into two, you can split into three or four or 10. So this is a threefold class cross validation where we have three rounds of testing and training. So in this case we take two thirds in blue and one third on the right, which is our test set, then we scramble it, and then we train on the stuff on the right and keep the stuff on the left as our test set. We keep the middle and red as our test set. And so this way we can essentially train and test a little more rigorously and ensure that our model has not been overfitted. Ideally what we should get in all three results is a similar performance on our test and our training set. If not, then essentially the model needs to be tweaked again through part of this in fold training process. So we can go even further. So if we have our data set, but in this case we train on everything except for one, and then we test on that one. And then we shift over and say, well, if the first one was what going to be our test one, then we'll say the second one now will be our test and then our third one. And if you've got a, you know, 10 million, then you're going to have essentially 10 million iterations of this leave one out validation. We don't train very much anymore, but it is a method that has historically been used. Permutation testing is another approach where you take your data. It's got the cold gold standard answer so it's it's been labeled ones in red ones in blue. And then you run your labeled data and hopefully your machine learning tool separates it or classifies it. And if that robustly separates or classifies as we've seen in the upper right corner. Then what you do is you randomly permute the data you relabel things. So if something was labeled green now you label it blue and if it was blue now you label it green and randomly mess it up basically. So you have your messed up data, which is down at the bottom called the permuted data, and then you run it through your machine learner, which has already learned the patterns. It should also mess things up. It won't be able to classify. So you can repeatedly permute your data and repeatedly randomize it and repeatedly run it through your machine learning learner. And if the machine learning learner still can't separate things. That's actually a good sign, because it shows you that the machine learner has learned to separate the real data it's learned the pattern it hasn't found noise, and it's not overfitting. So you can calculate your separation score and plotted out, which is shown on the left side, and where the arrow highlights is this distribution in terms of your separation score. And if you've got something that's well separated from the pack, then you can say that through permutation testing I know that my machine learned model is very robust and you can actually get a statistical value in terms of significance. Another way of assessing machine learning models is to use a confusion matrix, we're going to use this a lot. And this is assessing whether something was correct or incorrect, a true positive or true negative, also false positives and false negatives. So typically we're comparing between what's observed and what's predicted. And the common thing that we do in machine learning. And so this four by four table is identifying so something that you predict is true is actually true that's a true positive, something that you predicted as positive and is shown to be negative that's a true negative, but if the positive and negative are different then you can either have a false negative or a false positive combinations of these four true positive true negative false positive false negative. It gives you sensitivity and specificity sensitivities the true positive rate specificities the true negative rate. So typically in living systems biological systems, they follow sort of a distribution. It's something that's in blue was the diseased group and orange is the healthy group. But there's a distribution in their phenotypes or in their body temperature or whatever that you're measuring and so that distribution is shown here, and they will overlap. They're not typically so distinct that they, they don't overlap. It's the overlaps that produces the false positives and the false negatives. It's a non overlap regions which are the true positives and the true negatives. And then in the bottom I've given you the definition of what sensitivity or SN and specificity SP is. When comparing sensitivity and specificity true positive false positive, we often calculate what's called a receiver operating characteristic curve or rock curve. It's been around since the 40s, it was introduced when they were doing radar analysis of German bombers bombing the UK, and they were assessing the performance of the radar stations. But rock curves have now moved on and they're widely used in biomedical applications to assess performance of classifiers to look at biomarker models. And when a rock curve is plots the true positive rate against the false positive rate for some kind of binary classifier so predicted versus observed as the cutoff point is varied. This is an example of a receiver operating characteristic curve it's this brownish curve and these are points where we're plotting in the bottom is the specificity or one minus specificity and then y axis the sensitivity. And although you can't see it very well and this normally would be animated. There are different cutoffs as we move these orange and yellow and green and blue lines across the distribution to see where things are most optimal in terms of sensitivity and specificity. So if we cut off here we're way down on this range with very low false positive rate and a low true positive rate up here, which is optimal. So that's where round here. We get the best separation or maybe here is the best overall performance. So we can assess the quality of predictor or a separator or a classifier by looking at the area under the rock curve. And it's how we sometimes measure biomarkers or performance for tests, a random test or random biomarker would have an area of the curve of 50%. That's a coin flip. It's random. A perfect test that predicts whether someone say has cancer or not would have an area under the curve of 100%. Most biomarkers that are used in medicine have an area under the curve about 70%, which is normally classified as a poor test. But some of the better ones that are being developed with multi component analyses using machine learning have very into the curves of 90% or better. Those are very useful, very valuable tests. So after you have got your data, constructed data, transformed your data, chosen your model, trained your model, tested your model and validated your model, it's now ready for its Broadway debut. You can now take it out and use it to make predictions or perform classifications. It has to have passed all the training and testing phases, but once it's done that, then you can have some good confidence and you can publish it or use it or explain it however you want to do. And this is an example of a model where we've predicted, in this case, random or mass spec data using a hidden Markov model that also incorporates artificial neural nets. And this has been thoroughly tested, thoroughly trained, thoroughly validated. And in the course of doing it, we put it on as a web server. And so that's another way of putting out your model out there so that people can use it. I think we're kind of wrapping up for this phase of the course or this module. So I wanted to introduce you and over the next two days we'll introduce you to decision trees and artificial neural nets that's our focus for today, and then hidden Markov models tomorrow. So we're going to do one, two, three biotraumatics problems, one for a general classification, which can do for just about anything in biology. Then we're going to do something in secondary structure prediction, and then we're going to do one for gene finding. Now these are not really advanced in the sense that these are techniques that have been developed many years ago through machine learning, but this is a good way for getting you started. You can apply the same concepts to some of your own problems and your own interests. We're going to use Python and we're going to use a tool called Google collab. And we're going to go do a deep dive into the algorithms and the code. This I think is unique to this course. The concept here is it's sort of helping you understand what an engine is and what the wheels are and what the transmission are in your car. It helps to understand a little bit about your car before you drive it. If you just treat your car as a black box, you're going to have problems. And so we want you to understand a little bit. We don't want you to be able to pull your engine apart and put it back and become a mechanic, but we do want you to sort of understand the concepts and why these things work and why they don't work. So after doing the deep dive into some of the messy code that's involved in machine learning, we're going to kind of come up for air and show how you can do the same things using Keras and scikit learn, which simplify a lot of the more difficult elements of the coding for neural nets and markup models and decision trees into simple function calls. So it's sort of, you know, you've graduated now you know how your car works. Now you just have to put the keys in and press go. That's what sort of Keras and scikit learn and it's a lot of people who may not have really the strongest mathematical background or competing group background to actually do some pretty impressive work in machine learning. That's what we're going to focus on tomorrow is introducing you to Keras and scikit learn after we've tortured you with looking at the actual code inside those modules. So that's the end of module one. I'm looking at our time here. How are we doing. We're good. We're a bit early on the schedule but we started. So you're about like sort of right on time. Okay. So, now, maybe I'm going to stop sharing and I just want to ask a question to the group. How many of you have completed the pre reading material, if you can just go to the reactions and put up the yes. If you have. So, okay, just go to reactions just want to make sure. Now, it looks like there's a few of you have it and I think this is important. What we can do. And I'm going to go back to sharing the screen is we'll do this separately or we can go off with the TAs. For some of you who are still challenged but this is really to introduce you to Google Colab so I'm just going to race through this I'm not going to spend any time but this is the environment we're going to be using. And in order to use Google Colab you have to have a Google account, you need to have Google Drive. It means now by working through Google Colab you don't have to download Python or download R and install it, which can be a problem for certain computers. This just shows you how to set up a Google account. I'm not going to go through it specifically. But these are the slides that if you haven't set up your Google account and this is what you should do. And then through Google you can access Colab or the Colaboratory. There's a little video there that you can watch. It's run similar to what's called a Jupyter Notebook. Some of you might have heard of Jupyter, some of you might actually use it. It allows you to do online editing, viewing, inserting. If you've ever used Google Docs it's sort of the same thing. It allows you to combine code. You can write things, you can put in images, you can put in HTML. So these notebooks are actually pretty impressive. So this just shows you how to go into your Google Drive. Also shows you how to download the folder into your computer, how to unzip it. These are the folders. Again I'm not going to spend a whole lot of time. So again if you haven't done this, you need to do this during the break. And then also how to install the Colab app. And add it to your Google Suite, installing it, how to open a Colab file, the main page, and then opening your first Colab notebook. And this just shows you how to start. Step one, step two with editing, changing the name of the notebook, sort of by clicking and putting your name in there. And then how to code, which is just an example of the hello world for Python. Once you've entered the code, then you can run it by pressing the little arrow, which is the run the cell, and it produces some output. We also show you how to upload data in terms of sometimes you need data sets to be able to run the programs that we're showing you. And this also shows how to upload the data. You have to sort of click on a file folder, which would have that data set upload the file with the icon there to select the data files. And you can also do drag and drop if necessary. You can run all of the different cells. So many of the code sets or snippets are broken into cells, but you can run all of them to run the complete program. So at this stage, as I say, I raced through this. All of you, hopefully most of you will have gone through those before and will have had some success. If you haven't, we'll have time during the break or after this to make sure everyone's up to speed. So I'm going to introduce very briefly NumPy or NumPy, so it's a Python library to handle arrays or matrices or tables. It also has some higher level mathematical functions. So those table functions and or array functions dot product type things, higher level math functions are really, really useful. And so this is why we use NumPy in a number of our programs. And then pandas is also used for data manipulation and it's particularly useful for handling data in comma separated variable or Excel format or text format. It's into a data frame. So they have rows and columns and rows and columns or tables slash matrices and we use rows and columns a lot in in machine learning and obviously in bioinformatics. I mentioned as well that you can do machine learning in R and I know a number of you are more comfortable in R. So what we're doing in this course, as I say, is we're cutting you through it in in Python, but the equivalent code is available in R it's it's annotated in the same way. And those of you are sort of bilingual you'll see some useful similarities between Python and R. And with the collab they have our studio that allows you to do it. Which is summer faster as a rule Python is faster than our in terms of its runtime. And this is just how to start a collab in R. You'll also find in the student repository student pages, you can click on the hyperlink or the one that's given here. As you scroll down. You'll see in our labs where we've got the Python code the R code and the data collections, got clicking on those links. You can open up specific modules. So for module one we don't have any coding because I was just an introduction module to we will have the code. And as you'll learn tomorrow. There's a lot of other tools that to allow you to avoid some of the complications of the coding that will show you but whether it's psychic learning Keras these are the ones that we will talk about machine learning programming. There's TensorFlow as your ML pie torch and torch WECA and Moa. These are all essentially tools that people are using to sort of become pretty competent machine learning specialists. I like to do the complicated coding that by just sort of dragging and dropping or calling up specific functions. So we'll show you how it simplifies things like decision trees and neural nets and hidden Markov models, but it's important to also understand how those models work. To wrap up machine learning is a method of all programs or algorithms automatically. It's great for pattern finding, fitting and prediction, but it needs large data sets are different models which will talk about decision trees, no nets and Markov models. We won't get to all of them. I think we've shown you how it can be used in many day many everyday applications, and certainly through some of the tools like Keras I can learn and others is becoming much more accessible and much easier to use. I think I've given you some examples of machine learning and bio-traumatics. Deep learning is an extension of machine learning and is now being used more and more. And it's quite impressive and what it's able to do. The deep learning applications are starting to appear in the area of genomics, bio-traumatics, chem-intraumatics. It's really in some cases only limited by your ideas or the access to data that you have to really come up with some compelling machine learning applications. Now there's also something to know about, in addition to machine learning, there's another field of statistics called multivariate statistics. Some of you might be familiar with things like principal component analysis, hierarchical clustering, partial lease, squares, discriminant analysis, logistic regression or linear regression. All of these techniques actually are sort of precursors to modern machine learning, and in some cases they can do just as well. So principal component analysis does unsupervised clustering or unsupervised learning. PLSDA does supervised learning or classification, and PLSDA can do just as well as neural nets in some cases. Linear and logistic regression can perform just as well as neural net regression, SVM regression. So these are some caveats, and so that if you've heard of or have used multivariate statistics and have heard of things like PCIA or PLSDA, in essence you're kind of already doing it, but it's more statistically based. And it sort of underlines the statistical basis to machine learning that they are almost one in the same.