 So welcome everyone. This is lecture number one, module one for machine learning in bioinformatics. These are the standard slides that you'll be showing at the beginning of each presentation, just indicating that we're using Creative Commons license for the data. And the schedule is already something that Nia has highlighted. I'm going to show this schedule. Each time we start a new module just so people know where we are, roughly, I can keep track of the time because it's some of these modules finish at odd times like, you know, 1215 or start at 415 and 430 so that sometimes a little confusing. But you guys again can follow along with the slides you've already downloaded. So this is we've had before about asking questions, you can use Slack, you can use Zoom, you can also interrupt if you want through the audio. So this is something that's already been brought up. Likewise, if you need more attention or if the question is really long and elaborate or requires lots of TLC then we'll try and do that through the breakout rooms. Just a little puzzled about, yeah, I think I'm just going to change things a little bit here. So this is a workshop that you've given three times before. Well, this is the third time. And one of the things I think it's important for people to understand right off the bat is that we need to temper our expectations about what you're going to be doing. This is a two day workshop. It's not a two week or two months or two year program. Machine learning is not something you can basically learn in two days but the intention here is to give you some introduction and to maybe let your appetite, but also to bring in you into contact with people, particularly the TAs here, who can potentially help you with some of your learning machine learning problems. And this is what we've done in the past, a number of interesting collaborations have developed people have learned a lot. And likewise to interact with some of the people that you've already been introduced to some of you have somewhat similar objectives, interests, and sometimes working with fellow classmates might also be a way of working out some of your challenges. So we're going to talk about machine learning here this is a very gentle introduction to the first bit, we're going to go show some examples of machine learning, some of which you guys are probably well aware of others maybe you didn't know about applications of machine learning and bioinformatics and genomics and proteomics and other areas. And then we're going to talk about the standard machine learning workflow and this is where there's a lot of confusion and sometimes people will think that machine learning can solve all problems, or that it's the best choice to solve all problems. And there's limitations. There are excellent other methods that are far faster far better than machine learning in certain circumstances and sometimes understanding your problem and understanding a little bit about the solutions or the strengths and weaknesses of machine learning really helps. We're also going to try and introduce you guys to collab which hopefully most of you or all of you have done if you haven't will also take a little bit of time. We're also considering the break to help those of you who have had some challenges with collab or other parts and the code repository. So, so as I said one thing to do is understand that machine learning is very powerful and it's used in many fields we'll see some examples. Fundamentally machine learning is difficult. It has lots of sub disciplines where people have to be familiar with. If you want to really understand it, at least second year calculus and differential and partial differential equations. Typically you have to have advanced training and math. The coding is also challenging because it is advanced math, and you need to have pretty advanced statistics to fully understand it. Now, I don't want to scare people because many of you may not have that background and in fact we're going to be showing you. I guess two approaches to machine learning. One is the hardcore math, which is how back in the 80s and 90s when I was learning about machine learning how you had to do it. But a lot of that's been simplified now. And, and so you don't necessarily have to master math and computing science and statistics to use it. But if you want to be an expert at it you do. So there's a, there's a difference there. I mean, lots of people can learn how to drive but not everyone can become a formula one driver. So that said, two days won't make you guys an expert in machine learning. As I said it's intended to what your appetite to understand some of the challenges with it to look under the hood to appreciate some of the, the tough math, but also to maybe better understand the math. And then to show you how to get around some of those math challenges so that you can, at least with got some good programming skills, probably use machine learning and in your own work. Hopefully it'll inspire to learn more on your own. And in fact that's really the intent for a lot of these CBW workshops is to get you started. And if you're curious enough, it allows you to open other doors along your journey. Most of the fact almost all the examples I'll be giving will be using Python, and we'll be using the Google Collaboratory, which hopefully most of you guys have been able to log into or create. We've tried to create our code for each of the modules. I'm not sure Mark is working on one of them. Mark, were you able to finish the last bit on the artificial neural net for the gene finding. Yeah, I'm still working on the script. It's probably will be done, but tomorrow. Okay. Anyways, so these are for some of you who are, you know fluent in our who think in our dream and are we've tried to also produce our code for this but the preferred language really for machine learning is actually Python these days. So the code will be provided to you. That way you don't have to install Python or Python environment on your computer which sometimes can be challenging. So the co lab is a web based system for programming. Okay, so that's a little bit of a background tempering your expectations framing what you could or should expect to be able to do, and also giving you the concept that we're going to show you, you know, the tough way to climb Everest. We're going to take the helicopter up to now Everest and we'll tell you how to take the helicopter up to Everest tomorrow. But for today, we're going to show you how to climb Everest, the tough way. So don't be intimidated. But it is a way of understanding really what's under the hood with machine learning. So, we're talking about machine learning. It's probably good to remember what learning is. So, learning is something that we learn we do. Dogs learn nematodes can learn anything that is a process often through repetition way which an organism or system improves its performance through experience. So machine learning is actually a sub discipline or branch of artificial intelligence or AI. And it's focused not on organisms but it's on computers, improving their performance through experience experience that you've given to the computer, or to your model. So machine learning essentially develops programs or we call them models that can make predictions or decisions, classifying, partitioning, finding biomarkers without being explicitly programmed to do so. So that's an important caveat and important distinction. Now machine learning actually is really old. It's not much younger than the Discover Development Computers which happened in the mid 40s. So machine learning was already being done in the 1950s and Arthur Samuel is sort of considered the father of machine learning and basically he defined it as the field of study that gives computers to be able to learn without being programmed. And he worked on checkers solving. And interestingly, a professor here at the University of Alberta extended Arthur Samuel's work and developed a world's best checkers program and that largely launched activities here at the University of Alberta for machine learning. I think if you want to understand machine learning and distinguish it from traditional programming this is I think a useful picture. So when we talk about traditional programming, you're going to have a set of inputs. Those inputs will be read by the program. It will manipulate those inputs and it will spit out an output. So it could be, here's my list of genes and their association with the form of cancer and the output might be a statistical assessment of gene propensity for causing cancer. So it might just calculate numbers, add things up, divide by the total number, something like that. So a mathematical manipulation in machine learning, what you're actually doing is not just giving an input. But you're also giving the output to your, and it's not called a program, it's called a learner or a model. And what that learner or model does is actually spits out a program or predictor or a model or an answer. And that's fundamentally different than what we think of general algorithms. So this is maybe better described with this picture. So we have programs to do addition and multiplication and division and averaging and median calculation. So we have something that does addition at the top one plus one equals two. So learning what we would do is we give it a lot of examples of addition. One plus zero is one, one plus one is two, one plus two is three. So these are giving both the input, one plus one or one plus zero and the output, one, two, three, four. And by giving all of these examples, this large data set to the learner or the model, it actually comes up with a model for adding. So it learns to add. Now, it's kind of a silly example because computers are very good at adding and algorithmically and it can be done through Boolean operations. So it would be dumb to write a machine learning program to do addition but this is just a simple example, a toy example of how machine learning works. It's a lot of data where both the input and the output are provided. If you don't know the answer, then or if you don't have examples of the answer, then it's really hard to do machine learning. And so this is when some of you guys are describing your machine learning problems or things where you want to look at machine learning. Some of those things might have to be framed a little differently or you have to think about how do I get a training set where I have both the query and the answer. And if you've got that, then you can use machine learning. Here's an example where it's visual recognition. This is where traditional computer algorithms do really badly. So these are examples for examples of the number two written by different people. Some of whom I don't think know how to write but and if you gave these examples of number two to a traditional computer program, it probably wouldn't know what you're doing. If you've given these examples of what two looks like with different people writing it and told it that all of these are different ways of writing the number two, then a machine learning model would be able to essentially have a capacity to recognize the number two that probably wouldn't be able to do anything else. But this is an example of how you can do visual recognition. You could train it to recognize number threes and number ones and maybe more and more. And so this is how a lot of character recognition software actually works. The same software that's used to identify addresses on letters from Canada Post or the US Postal Service. So if you're distinguishing it then conventional computer programs perform very tedious tasks much faster and much more accurately than humans. So computer traditional programming is very good at addition subtraction multiplication averaging. It's also good for things like spell checking because it can look up, you know, every single word spelled correctly, and compare to see if it's you know close match and then give you the correct int spelling. So that's a conventional program. Now machine learning algorithms perform tasks that are difficult or infeasible through conventional algorithms. So spell checking is one thing. Grammar checking is totally different. It's not something you can just kind of look up in a long list and say does this match or does it not. It's about style and you know which words are best put together and past and present tense. Putting things into plural or non plural form. Taking a spoken word which is, you know, an audio file with variations and frequencies and converting that into words. It's something that we do every day. Computer traditional programming just couldn't do that hasn't been able to do that until machine learning came along. Image recognition looking at all these different versions of number two. Again, something that our eyes are good at our brains are good at, but traditional programming fails at machine learning is very good at this. So, if we're distinguishing between machine learning or ml and artificial intelligence AI. We have to remember that AI is older. It dates from the early 50s. And many people view machine learning as a subfield as of AI, both still require lots of data training data examples data input and output. Some people say they're the same some people view them differently there's no right answer. So, traditionally AI was developed on expert systems, where people would write out long lists of rules, if then else, or something like that, look up tables case for both checkers and chess, they had look up tables. There's certainly lots of data in the case of checkers and chess they had, you know, thousands of games, all loaded up so the computer could look up things and say, I see this I've seen this pattern before, I should do the next thing. So that's how they solve problems and I take advantage of the fact that they have this, you know, perfect memory and near infinite size something that the human brain doesn't. Machine learning doesn't use expert systems it doesn't use if then else it doesn't use large look up tables it uses probabilistic computing that uses statistics. It uses optimization techniques gradient descent methods to, or maximum likelihood expectation or maximize expectation methods to try and make predictions. It's different. Ironically or interestingly, most of the people who are now working in the field of machine learning actually started in artificial intelligence. And so many will do both. Although I think there's more and more people moving into the machine learning field, because of spectacular successes it's had. There's a lot of language things between machine learning and AI, something like face recognition, whether it's from CIA or ccis recognizing your face and it still has to be able to figure out where a person's eyes and nose and mouth are how they are in different proportions, or recognizing faces when the cell phone focuses on a face or faces. That's machine learning. Now, AI, which was done, made it sort of heralded debut if you want when when deep blue beat Gary Kasparov in World Championship test chess. So this is done in 1997 this used artificial intelligence but as I said it used large libraries of many many chess games and many scenarios that it played on its own and played against itself to build this this large look up table of what to do in a chess game. So, that's artificial intelligence it's like having a perfect brain, perfect recall. But it's, it's, I'd say different than machine learning. Some of you might be old enough. If you've watched jeopardy some of you also know Ken Jennings is the new host of jeopardy. This is when Ken Jennings was much younger. He's a guy who was widely considered to be the best jeopardy player ever, one of the smartest people around. And those who don't know jeopardy it's a game it's mostly about general knowledge, and you basically have to answer statement with a question that it's basically still a question answering skill test. Brad rudder is considered the second best jeopardy player of all time, and something called Watson, which is in the middle is a computer program. And Watson was a development project by IBM that was to try and make something smarter than a human or smarter than the smartest human being. So Watson was made its debut in 2010. It took four years more than 1000 people and cost IBM almost a billion dollars to make this program and after investing a billion dollars it won a million dollars in jeopardy in 2011. And it's actually a very impressive tool because it use natural language processing which was brand new at the time. So that's being able to understand English language or any language. It used information retrieval look up which is sort of an a I think it had automated reasoning. And it also used machine learning. At the time you had things like encyclopedias Wikipedia was sort of online dictionaries the sources newswire articles DBP which is part of the Wikipedia wordnet, Yago, all of these resources, which are online. And if someone you know wanted to read them and spend three years of the life reading it and totally memorizing it, it had that the Watson of course memorized all of this. So Watson has been moved from, you know what used to be very large computers now on the cloud, it has capabilities to see here read talk taste interpret learn and recommend. So it's, it was sort of the pinnacle, the K2 if you want to have of machine learning and what we thought we could do and with AI. So then, last year, I guess the Everest appeared. That's chat GPT. So this is a chat bot chat bot spectrum around for a long, long time but very simple minded ones. Open AI released in November 30 that's version 3.5. It cost almost as much as Watson about 700 million didn't take as many people is 350 people, but they still spent four years working on it. It was more sophisticated machine learning technology than what Watson had. So use the GPT stands for generative pre train transformer model transformer models are similar to graphical neural nets, which are more advanced versions of artificial neural nets. It was called or a member of the large language model LLMs. And it used huge amounts of text 45 terabytes. That translates about 300 million typed pages or 500 billion words. If you wanted to compare it most of you probably done texting most of you have auto fill. So you type in a word the first two letters and it guesses what the next one is, or next word is. It's kind of what chat GPT does it just instead of trying to guess what the word is it guesses the next 10 or the next 20 or the next 30 words based on what the previous first two or three words had been. That's kind of how brain works our brains how we how we speak. The amazing thing about chat GPT is it seems to be smarter than Ken Jennings it seems to be smarter than Watson it has passed very difficult exams and then almost perfectly and the SATs graduate record exams, legal sets, many other things. So, this is where machine learning has taken us this might be some of the reasons why some of you guys have signed in because what chat GPT can do and what machine learning can do is quite astonishing right now. Some of you have used chat GPT and maybe we'll ask for a poll. Well, you can just fill in the slack or on the chat box with zoom how many of you guys have actually used chat GPT. Just say yes, and we'll find out if there's 75% or 50% or 100% of you have done this. Chat GPT can be funny. It's made jokes. It can be used to help code and write code will see some examples of people written some pretty impressive cover letters and gotten jobs. Some people have been able to take on extra jobs pretending to be themselves and in fact the chat GPT is doing all the work. I can write poems and write lyrics for songs and a lot of them actually are very impressively good. So, there's also another realm and some of you guys kind of mentioned this as well machine learning versus data mining. Data mining is trying to discover previously unknown knowledge from a large corpus of data or text. Machine learning is focused on either reproducing or predicting from known knowledge. In fact, they require lots of data, they both can be used to predict but data mining is essentially allowing you to make new observations new inferences. Now chat GPT appears to be capable of that. So in some level chat GPT can perform elements of data mining and in fact you can adapt it and we'll show you guys how to do or use chat GPT to do data mining or information extraction. So what we've been talked a lot about over the last few years is deep learning. This is something that chat GPT does. It's something that most of the advanced tools and machine learning are able to do or employ. So deep learning is a discipline that subdiscipline of machine learning deep learning uses variations of artificial neural nets or an ends. The number of layers typically used in these an ends is is much greater. We call them hidden layers and so instead of having one or two hidden layers you have five, six, seven or more. So the an ends become deep neural nets. We have a. Sure. We have a hand up in the in the chat. Sure. I think I'm sorry I think I, if that was me I must have left it up from. Some of you I think I just we wanted to find out how many people had actually used chat GPT and I don't know if anyone's been able to count either from the slack or from the zoom. 18 yeses here and I think one hand was up to 19 in total I guess. So roughly half of you have used chat GPT, which is good. So I won't be sort of completely reviewing something everyone's been doing. Next, you know tomorrow we'll talk a little bit more about what chat GPT can do for you. Anyways, the point about deep neural nets is that they are because they're more complicated and take much longer to train and require often very specialized computers with graphical processing units or multi CPU sets. So they can learn more complicated patterns, they can handle tougher problems, they can make smarter predictions they perform better. And so this has been critical to the large learning language models with the LMS. Now within these deep neural nets people call these things called recurrent neural nets, convolutional neural nets, graphical neural nets, deep belief nets. These are all examples of architectures for deep learning. And many of them are just inspired by what we've learned about the human brain about how we learn how we have short term and long term memory, how we forget. Getting can be important for learning. And how we reinforce what we learn. If we're trying to say a new word we might say it three or four times to practice. A lot of those ideas are implemented into these more advanced deep neural nets. So this is where biology has actually inspired a lot of what is used in machine learning. Players, the people who started the field of deep learning are two Canadians, Jeffrey Hinton, who's at the University of Toronto and Joshua Benio who's at the University of Montreal. Hinton originally from from Scotland, but moved to Toronto, I think in the late 80s. He was working for the engineering award, which is a Nobel Prize for mathematics, this fellow for all society, FRSC. He was working for Google he resigned out of his concern over chat GPT. But he's still continuing he's older and it's I think semi retired, Joshua is much younger. He started a large company in Montreal, and it's largely because of Joshua that I think a lot of the AI activities in Canada and in fact in the world are based very much in Montreal now. There are three approaches to machine learning. One is called supervised learning this is the most common one, probably 99% maybe 95% of machine learning is in this area. So supervised learning, small percentage then reinforcement learning, which was actually also used to help improve chat GPT, and which is used to help with certain select problems. So supervised learning is you're giving examples of inputs and outputs, you're giving desired labeled outputs. This is the example of how do you learn addition. And the idea is to learn the rules in this case learning the rules for addition that map the inputs to the outputs. It's how you find biomarkers. It's almost all things that you guys have described what you want to be able to do in some form of classification, or interpreting or pattern analysis will require supervised learning. So supervised learning is saying, I have no idea what I'm doing. You have unlabeled data and it tries to figure out rules to find structure in the input data. It's, it's actually how humans to some extent learn. It requires elements of creativity. It can also be thought of as a simple clustering approach. There's a whole bunch of socks. They've put through the washer and the dryer and now you're trying to pair them up. Most of us will try and either match them by their shape or by their color. So you're not going to put one white sock with one blue sock or you're not going to put one large sock with one short socks so we kind of intuitively know how to do this computer might not know what to do with socks but if you sort of let it say how do you cluster it. There's some rules about how to cluster it might figure out some of those elements. Reinforcement learning is something to essentially give continuous feedback to maximize rewards. Now that is something like gradient descent optimization which is used in many learning methods. But it's also how we learn when someone says, you know, nice job or a dog learns when it's going to fetch you pat it on the head. It's the same similar technique, I would call it a similar form to supervised learning but some extent it actually optimizes and can optimize faster for very difficult problems. So supervised learning is used for classification grouping regression, which is a curve fitting. In the filters and ranking and recommending systems face verification voice recognition biomarker identification gene signal analysis, almost all the things that you guys were describing is what you'd like to do or use machine data for. I'm supervised learning. It's when you're doing things like target recognition sock pairing seismic data analysis where it's just a lot of noisy data and you're trying to figure out where something significant might be reinforcement learning has found most of its use in man machine environment applications so autonomous car driving is one example and that GPT, especially refinement. So I talked about a program or a model or a learner that's used in machine learning. So different people called it different things. I would call it a program because it's as you'll see these are computer programs that you still write people machine learning prefer to call them learners or models. And those models fall into different classes. We're going to look at two and particularly we're going to look at decision trees, which are sub branch of random forest. And we're going to look at artificial neural networks, but there are others things like support vector machines, this genetic algorithms, convolutional neural nets, cursive neural nets graphical neural arrays and networks so on. These are all models that are available. Some are very easy to code like decision trees. Others are quite a bit more difficult, the neural nets are convolutional or not so our genetic algorithms are easy to code as well. So what are the applications. So a lot of us have heard about self driving cars. So if I have some of those operations, there isn't really an autonomous car on the road yet. Game playing face recognition fingerprint recognition traffic sign recognition, automated stock trading email filtering gesture recognition for various PlayStation type games speech recognition handwriting, radar analysis, medical diagnosis, bioinformatics and so on. So the list goes on and on. Some of you may have things like Siri or Alexa or Google bots that help control your office home or apartment, where you can talk to the bot or the device. So if you say, you know, Alexa turn on the light, it'll take the sound which is an analog audio convert that to a digital set. It'll go through various pattern recognition. Typically it uses a deep neural net to classify the spoken word and then to identify your, your voices, break down and parse out the words and then create what we call the segregated speech. And from that comes back and said, what can I do or turns on the lights or it turns off the lights or plays music. This is something that maybe about 20% of the population has these things now, but it's an example of speech recognition with your phones if you use that a lot, which probably 90% of you have also similar kinds of voice recognition. And it's deep learning that does that. If any of you ever had or been called up to say your card credit card has been compromised. That has been done usually through machine learning techniques. It's called anomaly detection. And it takes data from your credit card purchases over the last number of years, and tracks, not only where you purchase things, how frequently whether you tend to buy them and things on weekends or on evenings, the stores you go to. And it extracts patterns to identify this is a typical way that you buy things. And it also includes information about, you know, your, your age your sex and your job. And so it looks for things that would be highly unusual. So if, if you're a male and you're buying, you know, women shoes and women's underwear and tampons they might suggest this seems unusual. And would say this suggests there's fraud that you are not who you say you are and someone's buying things under a different name. Or if you've always been living in Winnipeg and then suddenly they're saying purchases in Saudi Arabia one day and Manila the next day and Cape Town day after. Clearly you're not someone who travels the world all the time and jet setting around would suggest there's a problem. So that's how fraud prevention is done. It's highly personalized everyone has a profile. And that's how they detect anomalies. One of the biggest news events in the machine learning world happened about 15 years ago which is called the Netflix challenge. I'm sure a number of you have used or subscribed to Netflix and it has options where you know, talks about your preferences or suggests certain TV shows, based on what you're you've been watching. So Netflix had developed its own algorithm. It's a traditional programming algorithm, but they opened up a contest to people saying, let's see if you can come up with a better method that predicts what people will prefer based on their viewing habits. So some people like, you know, action adventure movies, some like romantic comedies, some like foreign films, some like, you know, black and white. The thing is that you can, you know, infer and classify and get a gain a personal profile not unlike the credit card. And in fact, even as far back as 2009 the winner used you know, data from hundreds of thousands of users, thousands of movies, it was still 10% better machine learning algorithms 10% better than the Netflix algorithm which they had spent millions programming using traditional methods. It's a really significant boost and in fact, Netflix continues to use that Amazon continues that as well to identify what you might like to purchase. Again, that's machine learning. If you have a cell phone, and you take pictures with it. Most of you will have fairly advanced face recognition. And this allows the phone to focus on faces and to avoid sort of you ending up focusing on the background or on something that's just one individual so in the picture above there's one individual who maybe would have been in focus everyone would have been out but because the phone was able to recognize faces. Everyone's face was put into focus that it's just about the aperture and timing automatically. So this is gain is part of the autofocus auto aperture adjustment auto shutter adjustment that's used in in most cameras, autonomous vehicles are also examples of where machine learning is used particularly reinforcement learning. They've been testing and testing for more than a decade. And driving is a really tough challenge, you have to be able to recognize a road. You have to recognize changing road conditions you have to recognize vehicles coming at you road signs. And you have to do things like path planning. You have to, in some cases, learn from how an individual good driver drove some cases you have to simulate conditions with bad drivers all around you. So reinforcement learning seems to be one of the better methods but it also has to use vision. So pattern recognition so supervised learning is also needed. It is still, I would say, an unsolved problem. It's still something that humans can do better than computers, but probably not for long. So I've given you examples of where machine learning is used in some cases in everyday life with Siri or Alexa or Google or in image recognition or in Netflix, fraud prevention. Obviously this course is about bioinformatics and machine learning applications and bioinformatics are almost as old as the advanced machine learning secondary structure prediction, which is predicting from a sequence where alpha and beta strands are finding genes, finding motifs, doing GWAS analysis and SNP typing, disease classification and disease diagnosis, biomarker identification, DNA sequencing. You can apply it to predict spectra, NMR and mass spec. It's being used in drug design, drug discovery, also used in protein three-dimensional structure prediction. So I've been doing machine learning for a long time, started in about 20 years ago, probably was really interested in it actually in the late 80s, but really didn't publish it until somewhat later. So we looked at in this case, predicting breast cancer susceptibility using SNPs. We wrote a review on applications machine learning and cancer prediction. Apparently it was one of the first articles in that area has been pretty widely cited. We've used it in protein secondary structure prediction. This was about 15 years ago, more recently than using it for tools for predicting mass spectra and NMR spectra. And this is very helpful in the field of metabolomics, but it can also be used in proteomics and many other areas. So I'm coming at it as, I guess, more of a user and certainly worked in the area of machine learning and genomics and proteomics and metabolomics and applied it in a wide variety of areas. This is another application where we're looking at machine learning for genome wide analysis associations. We're looking at SNP panels from GWAS studies to determine optimal collections of SNPs to make a prediction of certain disease. And we use something called support vector machine SVM and REN forest regression to calculate these receiver operator characteristic curves. This is another example of using machine learning to help, in this case, genomics and biomarker identification. Now there are other applications. I'm not just going to talk about things I've done, but there are some interesting ones. So here's a case where they use SNP variants, looking at literally hundreds of them with different phenotypes and they're looking at the SNPs for heat sensitivity. So some people can hold on to a boiling hot cup of coffee for hours and others have to use oven mitts basically. So some people are highly sensitive to heat and some are not sensitive to heat. And they know some of the genes that are associated with it. They did the chip V1, chip A1 genotypes. They did next generation sequencing to look at all these variations. And from the use of non-supervised something called swarm clustering, they were able to identify and pull out 31 gene most sites. So they reduced a large number to a smaller number. They identified and started categorizing people who could be sensitive in which genetic components to those SNPs were allowed you to categorize people as heat and non-heat sensitive. Some of you have maybe worked with the Minion, this little tiny DNA sequencer that uses nanopores. It's the Oxford nanopore DNA sequencer. They've been working on it probably for 15 years and it does long read sequencing. And I've been watching it and working with people on Minion for many, many years. And so they use crowd sourcing to help solve the DNA sequencing problem. So nanopore is sort of embedded into an electrically sensitive membrane. And so when DNA is passed through the nanopore, you can get a change and an electrical readout. It goes up, goes down depending on the sequence. But it's not obvious. It's not always down when it's an A and it's not always up when it's a C. It could be something intermediate depends on both the length of the sequence, the number of repeats, how much it changes if it's ACGT, ACGT. That might produce a very different signal than ACGTGT. So in this case they had to use hidden Markov models. They used the current neural nets and a variety of other machine learning methods, and many, many people tried and adapted and it's said this is crowd sourcing machine learning. But the venture they solved it. And so you can take these readouts which I am showing on the lower left. And those signal intensities can be read out as letters as DNA sequence now. And it's quite accurate, quite fast, incredibly cheap. So their machine learning converted something that seemed to be at the time impossible to do to something that is now widely used for long read sequencing. There are examples of something called deep bind. It's a deep neural net field for predicting sequence, specifically between DNA and RNA binding proteins. So how do you predict which proteins and based on their sequence. On a strand of DNA, or on a strand of RNA. And so this is, you know, taking whole list of known motifs, performing scans, then looking at features using neural nets targets, updating the parameters so very much a neural net approach, but it is something that can do quite well in terms of predicting which proteins will bind where and potentially why. There's a deep neural net deep bio seat that can use deep learning convolutional neural net and can analyze RNA seek date. You don't have to do DNA alignment you don't have to do a lot of sequence pro processing can work with directly with fast Q files. It assesses the quality of the reads and people have adopted it to do single cell sequencing and ship sequence analysis. So gain applications of machine learning, looking at sequence data or even just fast Q files. People have been using machine learning to optimize the design of CRISPR target sequences. These have to be optimized to some extent if you want to be able to do efficient gene insertion gene replacement. It's the same sort of thing if you wanted to optimize other types of sequences in terms of recognition. So, again, machine learning in my throat, it's very useful. In many of these cases we're looking at that sequence data. And so that's where we're focusing a fair bit. For the examples will be choosing. Sequence data is like learning the language so just like chat GPT learns to recognize words which are basically sequences of letters. DNA sequence data is something that's very amenable to machine learning applications, as is protein sequence data. You can also use machine learning in areas of cancer applications, you know, system support tools for cancer screening so this takes genealogical data. I can take mammogram imaging data, it can take genetic test data. So very diverse types of data, and it'll feed that data in, including electronic health record data. This performs integration, and then comes up with a risk assessment for someone's likelihood of developing cancer. You can do this also for, you know, summing together metabolomic and proteomic data together to help come up with better risk scores or predictions. Rather than relying on a physician using machine learning tools they were able to show this not only improve the cancer risk assessment but also was much faster. Some of you may have used 23andMe, and they're always changing but they had essentially not only do they take your genome and analyze it but it's essentially a SNP test and performs a few hundred thousand SNP assessments. But because they've had many, many people take the test, and because people sign off and say use the data whenever you want, they were able to take data from 600,000 people, take their GWAS data which they get from 23andMe, and integrate information that people provided them about their weight and height, therefore their VMI and their lifestyle, to come up with sort of a genetic weight predictor. And this case it's predicting that this person will have a tendency to be slightly overweight. Your genes predispose you to weight above 3% above average. I'm not sure if this is probably something that's only offered to people in the US as opposed to people with 23andMe in Canada and they're running into a few challenges, there's a few problems because in many cases GWAS data isn't that accurate. In the sense of being able to predict physiological outcome. Tumor genomics, looking at single nucleotide variants or SNPs as well, tumor samples. This one uses a fairly simple technique of random forest, which is a collection of decision trees. And it was used to essentially classify tumors. And I think some of you guys have talked about this. So this, you know, have to have a collection of appropriate data they wanted to have both normals. They selected certain types of features. They worried about things like strand bias, variant allele fractions, batch effects, but by including those features and those elements in their model. They were able to get almost perfect sensitivity and sense specificity, which is much better than what humans can do newborn screening, something that any of you who are under the age of 30 have had. So within two to three hours after you're born, they take a blood sample, they run through a mass spec and they determine if you have metabolic disorder. Something obviously none of you remember something that none of your parents even know about, but in fact, tabloid screening is the most widely used test in the world 300 million people have had it, more than a million people benefited from it. But people do make mistakes. And so by using machine learning to help interpret the mass spectral data that they collect from these blood spots, they were able to reduce the number of false positives from 21 to two. And for field keeping area from 30 to 10 to hyper methionemia and from 209 to 46 for another metabolic deficiency with a carnitine deficiency. One of the biggest breakthroughs in machine learning, which I think I hope some of you've heard about this for alpha fold or alpha full to. Again, maybe we can just take a poll. If you could in the chat, just indicate if you've ever heard of alpha full to or the keys can look at how many people voted yes. Anyways, it was considered the breakthrough the year in 2021. It could be when most people were hibernating in their basement from COVID so it didn't kind of make the headlines I think people would hope to, but alpha full to has solved one of the biggest challenges in biology which is the protein folding problem. And it's put a lot of structural biologists out of the job. But it's changed how we think about structural biology. To used huge amounts of data have already solved protein folder structures that were in the protein data bank. It then used a lot of sequence data because we've sequenced literally millions of proteins through DNA sequencing, and then did multiple sequence alignment. And from these literally millions of multiple sequence alignments. We did something called embedding, which is a fairly time consuming process but it encoded both the alignments and sequence information intelligently. It also had information about pairwise distances or distance matrices. It also did co evolution analysis which is shown at the bottom, which is something that people picked up about 10 years ago which noticed that some proteins. There are multiple sequence alignments. There are correlated mutations that if one is changed than the other is changed. And this suggests constraint that these residues have to be in close contact with each other. So, all of this information was being used and put into a fairly sophisticated deep neural net tool to create alpha fold one and then alpha fold to. And this is alpha fold to performance with at the time the two other top performing programs one is called Rosetta, which made news about 15 years ago, and a team at all, I'm not sure if they're Emory University or something. And they were trying to solve the structure for an open reading frame from the covert protein. The structure just been determined. And then they asked the team from Emory Zhang and then they asked Rosetta University of Washington to run it through. The Zhang model was completely random coil had nothing to do with the actual structure was that a new was mostly beta sheet but there's actually no similarity really overall the alpha fold to structure was essentially bang on. An example of a protein there was essentially no sequence homolog they couldn't do machine or modeling at all homology modeling. And so this is just quite striking and because this is published during the COVID pandemic, because people are trying to find target proteins and ways to find drugs. And this is sort of, I guess the nail in the coffin for these other programs but also a real triumph for what machine learning can do for protein structure and drug discovery. So I'll stop right here. And maybe I'll ask if people have any questions or comments, and maybe I could also get a feedback from the polls that we took how many people have actually heard of alpha fold before. Maybe it was some 21 responses 20 were yeses one no that is here. Okay, so either 95% of you have heard of it or 50% of you have heard of it. The. So, does anyone have any questions about, you know what I mentioned so far. I would have I say the gentle introduction to machine learning. So I assume by the long silence. There are no questions. So gone to has his hand up. Go ahead. Can everyone hear me. Yeah, I was wondering so you mentioned that data mining is different from the sort of traditional machine learning. But so is unsupervised learning not a kind of data mining as well. It could be seen as that. Although traditionally hasn't been applied to data mining. The typically data mining is done more with with text. It could also be done with large, you know, tables of data. And I'm the supervised learning is done more with, I'll say numeric data. And it may be looking for general patterns as I said seismic data analysis. And the peak finding would be examples of perhaps someone supervised learning clustering is pseudo supervised learning method. But as I say, most people don't really do a lot of unsupervised learning machine learning are, I guess data mining is essentially extracting it's pulling information out, and then looking. And then looking at patterns, then perhaps like going on through from those patterns looking at inferences. It can be algorithmically done, as opposed to probabilistically done. But yeah, I suppose someone could apply and supervise learning to data mining if they want it. Okay, yeah, thank you. Nazia has a comment or question. Okay, Nazia, would you want to read to it or should I just read out from the chat I can read out. So she's asking wondering if you will get to learn more about how she was somebody stats can be used with MO. Yeah, we won't have that in this course. I'm sure that I highlighted in that GWAS rocks curve paper actually talks about how you can take GWAS summary stats, and, and extract information or, or calculate information out of it. Because most GWAS data is only available in a summary form we have to get otherwise very specialized permission to get the, the really detailed data. But summary data is widely available and so when it's in databases and there are ways of extracting I think some useful information about using machine learning, and so that paper explains it. In my cases I think, you know, what we're trying to get you guys set up for is so that you can look at other papers that do something similar to what you want to do, but look at it with a different eye afterwards. Maybe not be so intimidated by what they're describing or be a little more aware of the language they're using. But we're you know in two days we're not going to be able to solve every problem, or show you how to solve every problem, but to give you more grounding so that you can, as I say intelligently read some some papers or download some software and know what you're doing to actually install it so it helps you in your project. You know, you're free to talk to Vasu and Sagan and Mark because they've solved lots of machine learning problems for lots of people. And that's why we have this TAs here. So we do have a question from chat I'll take first question from chat and then we have two hands raised. So question from Sonetra does in the chat, what are the applications of genetic algorithm. So genetic algorithms are more often used for optimization and searching through space to optimize but then you can use a genetic algorithm as part of the optimization process that's used in learning. So genetic algorithm would be, I guess, a tool that would be part of or potentially part of some optimization part of learning. I think there's some people who have used genetic algorithms as as a learning process. So genetic algorithms are very simple to program, they sort of, you know, just like genes, you know, crossover and hybridize, you can and will mutate and transform and frame shift. That's sort of a way of moving your data or configurations or examples around. And it turns out to be a fairly efficient way of optimizing things and this is how evolution has happened. You know, genes have crossed over and mutated to change from, you know, single cell microbes to, you know, giraffes and elephants to adapt to different conditions. This is how, you know, optimization can be done, which is part of machine learning. Probably nicely. Please go ahead. What do you question. I was wondering the effect of the quality of training sets that you will have for the machine learning. I was using off of all two a lot. And I know like the convolution theory and where is it coming from. But I was wondering what did they do much better. Like, is it more algorithm is much better or their training set is better. There are lots of things sort of simultaneously that made Alpha full to better. And there's a couple of articles that have appeared when I think in Scientific American last year which they've talked about the history of what they did and how they, you know, rewrote Alpha fold into Alpha full to a lot of it was the machine learning model they chose it was more advanced and more sophisticated. And so, I think they attribute a lot to that. I think it was also making better use of the data. A lot of people just use co evolution, nothing else. The difference distance matrices were critical and people hadn't really used that before. I think they have a really good energy optimization function, which gives high quality coordinate and produces things that better. I think they had reached a critical threshold in terms of the number of sequences and the number of structures. So, you know, things start happening when you get beyond a certain threshold. This is something that happened with the large language models, you know, chat GPT was kind of useless two years ago but as the model grew. Then it suddenly passed a threshold. In terms of the number of data points that were in it. And this seems to be a. You know, you have to experience it yourself, you know, you struggle to, you know, do a problem or to, you know, perform a gymnastic trick you practice your practice and suddenly at some point it just, it happens. And that's where the point where some of these models reach where they, they've done it enough or they have enough data, or they've tried enough times that it, it now has figured it out. So, I think it was a combination of many things for helpful to. But there's no one single one, I think that they can point to is being the real reason why it got so much better. Thank you. Hi, thanks for the introduction. I actually do have two questions regarding the ML and the actual course. Looking at the just material that you have shared, I have not seen any kind of normalizing methods to pre clean the data set before you actually put in the training set. Do you have any. Do you have any kind of resources to get trained on this and second question is that I've heard there is some kind of black box in terms of machine learning or even deep learning, meaning that sometimes the model do not find the right features but you can't even batch effect or things like that and choose that feature and give the wrong result so we're able to explain a bit more about this thing, the black box. Thank you. Yeah, I mean in terms of, you know, data cleaning, I will talk about that coming up here. I mean there isn't any single algorithm or magic solution to data cleaning. You have to be, you have to understand your data. You have no idea what your data is, what it's about. You won't know how to clean it. But there are some pretty standard things about, you know, imputing missing values or normalizing things so they're within a certain range. In terms of the black box concept. I mean there are some machine learning algorithms that are not black boxes decision trees and even random forests are not black boxes, which is a reason why some people prefer them because they can look at the code and look at the results and rationalize what's going on and what's what's been chosen and why neural nets are black boxes, although there are some methods that kind of allow you to figure out a little bit about what it's trying to do. You can do something called feature selection which makes sure that your model isn't choosing random useless data to make its inferences or to become overtrained. We'll talk about that coming up here shortly. Not sure if you have a time for another question. There's a question in the chat. You can do that one from Shabat. Could you please discuss about recommended data size data points to work with ML models for accurate predictions. Sometimes it's very challenging to have a big data set due to different factors including cost efficient genetic material extraction in the presence of innovators, especially working with environmental samples. So the first part of the question was could you please discuss about the recommended data size data points to work with. Yeah, I mean I'll talk about that a bit more, but it's sometimes hard to know exactly what the size should be. There are tricks. Usually you need thousands of examples is a general rule of thumb. So trying to have a few dozen is barely never works basically a few hundred sometimes can work. There are ways of working with small data sets. And I guess coming up with intelligent or ingenious ways of supplementing or amplifying the data. So in the case of alpha fold, we only had 100,000 protein structures. But we knew enough about protein structure to say that if you've got a very, very similar sequence is also going to have a very similar structure. So effectively what they did is they use sort of the equivalent of homology modeling to create literally millions of structures for which alpha fold could train. There's other techniques for chemical structures where you can write the chemical structure in a string called a smile string. But there's many variants of smiles that'll produce exactly the same structure. So you can boost the number of smiles strings or example structures by using variant smiles or non canonical smiles. But again, if you only had a subset of 100 structures, you could create a million smiles strings, which then allow your model to have a lot more data to kind of learn the same phenomena. It has to be data specific. So, and you have to be, you have to think about it in a few ways. You know, the skill in machine learning isn't so much necessarily, you know, which, which algorithm or model you choose it's a lot about this defining your problem, constructing your data set, transforming your data set. Those are where the real winners are. And the best machine learners are the ones who do those things well, whereas choosing the model is just pretty wrote right now. Okay, I think we'll probably have to get underway because we only have about 20 minutes left here. I think. So this is the machine learning workflow. And this is one that I'll show over and over again but it's one that you guys need to memorize. I think if there's anything I can get out of this course. There are six steps. And the first step is really defining your problem and having a target solution. A lot of people make mistakes by not defining a problem very well. I want to solve the world's problems. Well, that's not well defined. I want to solve the protein folding problem. That's reasonably well defined but it's a given a protein sequence. I want to determine the three dimensional structure within, you know, two angstrom RMSD. That's better defined when it has a definition. And proposing a solution, which might be saying, I want to use the protein data bank as a training set or I want to use a protein data bank, plus a whole bunch of homology models plus co evolution plus multiple sequence linemen to help with this process. So you have to think about your problem long and hard, identify whether you have enough data, think if it's well defined enough, think if it's something that gives you both an answer. So in the case of the protein folding problem, we had answers, we had structures, we had 100,000 structures. So that was a large database. But we also knew that there were millions of sequences that we hadn't solved for so that was a, you know, well chosen problem, a prominent problem, but it had enough training data, or if you thought about it in smart ways, could create enough training data to make it solvable. So once you've chosen your problem, then you have to construct your data set. And so gain and alpha full two they had data. So it was clean so it wasn't have to do a lot of data cleaning. But it took a long time to build those data sets to do those multiple sequence alignments to run the programs multiple sequence alignments weren't done with machine learning they were done with multiple sequence alignment programs. And so your analysis was done by a game programs, not machine learning, but those produced the data that was then put into the machine learning. You have to then transform your data sets and select features. So that was also important for alpha full two, they had to normalize distances, this is why they use a different distance matrix, I think helped a lot. But to normalize their data, they had to do some feature selection, although in some cases the models themselves are able to feature selection. And then they chose, you know, they're, I guess, transformer network was their model that they eventually chose. So that was the model and many people actually best machine learning groups will run about 10 or 12 different models. And they'll see which one performs best, and you can't necessarily know which one will perform best. And each of them still needs little tweaks, you know, some models have to be normalized. Some don't. That is the data has to be normalized or transformed some don't have to be. You have to know which ones need to be manipulated which ones don't then you run the models the assessment performance. So once you've, you know, trained your model on some training data, then you have to test that model. And this is where a lot of people make mistakes, where they just basically stop at the training set they publish and say I've got it all solved. So then someone else takes their model runs it on their own data set and the model fails completely. This is why you have to do things called validation. And you have to use things like leave one out or X fold cross validation. Once you've done the appropriate testing and validation, then you can say, okay, my model's finished. Now it's ready to do things. And so in the case of alpha fold they said, yeah, we've tested it, we validated, we trained it as much as we could. And they were happy with this performance and then they released it to the public, and I've released all kinds of models. So the program can be downloaded you can run it on GPU computers. Most people make use of alpha fold just by downloading the structures. I mean, how many of you have actually run alpha fold on a GPU computer. You can put it up your hand or put it into chat, because I think a lot of people use alpha fold claim to use alpha fold but all they're doing is just downloading the structure that was already generated by alpha fold. It's still fine. But it's, it's, it's running the program is one thing, making, you know, getting the results from the program is is a different one. So when you're choosing problems, obviously when I work on a problem hasn't been solved, or that is interesting to you in your research. Sometimes people propose problems that are very easily solved mathematically, or statistically. So, and some of you, at least from what I heard, we're describing problems that I think would probably be better solved by an algorithm, or a technique and pro traditional program rather than machine learning. Sometimes, obviously trying to do things that are just too tough to do, or require a huge amount of training or knowledge. So I want to look at things like finding patterns, classifying things, classifying groups, identifying certain features. But the other thing that a lot of people gain from some of the descriptions you had, there weren't a lot of training data to work with. Or examples where the answer just isn't known. Well, I want to find dark metal by the machine learning program to find dark matter. But we haven't found dark matter, we don't know what it looks like we don't know the answer. The same sort of thing as, you know, write a machine learning program to write my thesis. That's pretty open ended. There isn't, you know, there's lots of thesis examples, and you could probably use chat GBT to write pretty arbitrary thesis but it would be mostly hallucinating. So, you know, you need to have training data, and Alpha fold is a good example. They had lots of training data. Or you need to be smart to come up with ways of amplifying or boosting the amount of training data that you already have, not so that it's just purely made up, but in the sense that it's reasonable or analogous. Constructing your data set. In this case, you have to get your data from a reliable source. I've seen lots of very poor quality data. This is where, again, most people make mistakes, and most people just download the data set they're given. It's a situation of garbage in equals garbage out. The data has to be labeled. This is what you're doing a machine learning. So you have to have, you know, they have to have answers to them. Remember, there's an input and an output. So you could have categorical male female healthy sick nominal or numerical numbers associated with things. You have relevant parameters to describe the phenomena. So, if you're talking about protein folding, you know, understanding the phase of the moon or the astrological sign doesn't have any relevance to protein folding. It could help with predicting DNA binding motifs. But, you know, obviously the sequence does or something about the secondary structure could or something about the organism may have some relevance or the temperature at which the organism lives. Those could all have some effect. They might have some way of contributing in terms of DNA binding or protein folding. So, you know, use your knowledge, use your intuition about what sort of data you want to include, because if you miss the central features, and you try and predict without those, you know, key pieces of information, your machine learning model won't do very well at all. So trying to predict protein structure without the protein sequence, it won't work. So you need training data, you need testing data, and you need validation data. So there's three types of data that you have to create when you start off with the machine learning. Say there's no right answer. People can get away. Perhaps with us as minimal as few as 1000 examples. I've seen situations where people have had as few as 100 examples, but then they used, you know, smart data amplification methods to get, you know, tens of thousands. Most of the average machine learning problems that people work with have, say, 10,000 to 100,000 examples. We'll be doing one which uses maybe 700 and you'll see it doesn't do great. And I think if you had more it would have done better. And really tough problems. This is where people talk about parameters, how many billions of parameters and millions of parameters in large language models. It can be millions or billions of examples. So chat to BT and others have, you know, between 10 and 100 billion parameters and used hundreds of millions or billions of words and text and examples. You can use, you know, this is for deep learning. So, not every problem you guys are thinking of has these requisite numbers of examples. And that's something that might constrain what you want to do. But there are also smart ways of taking relatively modest amounts of data and amplifying it to get to the stage where you can do machine learning. So after you've constructed your data set, then you have to transform your data and select your features. So cleaning up the data, remove repeats, fill in or impute missing values, reformat some things that compound, look for outliers, group classes. These are all things that people often don't do but are needed to do. And this means you have to know something about your data and we call it data cleaner data cleansing. Whether it's categorical or named or nominal data to the Meribuns, you can use things like one hot encoding, which is very helpful for machine learning. You can normalize skewed data that is making it more Gaussian, you can range scale. So there's data transformation and something called feature engineering. You can add additional features, include obvious relationships that you know about the field, you can select some features that's called feature selection, keeping relevant data but remove your relevant data. And this is where human intuition helps and it's a lot of the best learning algorithms used human intuition to do feature selection. So encoding is an important thing we'll see it a few times later today, where the way that we think in our brain you know here's a four balls, one is red another blue another green and another one's blue. We understand color, our brains understand it machines don't so you can convert colors to a sign up a digital signature with the three binary three bit kind of encoding. The red is one zero zero green is zero or blue is zero one zero and green is zero zero one. And now you've done one hot encoding. You can do this for colors you can do this for letters and sequence. One hot encoding is really easy to do. But it doesn't give you context. So if you've got, you know, the word RED. There's a sequence of letters. So if it's ERD, it doesn't mean the color red if it's DRE doesn't mean the color red it has to be RED. So embedding gives you features with similar influence the data strength, it's sequences or words, it gives you similar values for specific features. You can encode or embed sequence data you can in hot, hot encode sequence data. So if you've embedded sequence data that's telling more information there. It's useful in natural language processing named entity recognition text summarization. But people used it in gene finding and protein structure prediction as well. Again, it's sort of with text and words or letters, and where the sequence of letters or words has important meaning. So here we can take eight or nine conditions man, woman, boy, girl, princess, queen, king and monarch and we can create a nine bit encoding. A man is 1 0 0 0 0 woman is 0 1 0 0 0 boy is so on and so you have this big matrix and it's just a whole bunch of ones and zeros. You guys would probably notice that you know, men and women boy is similar to a man but younger prince is similar to man because that's what you call male royalty. A king is you know higher up in a prince and a monarch is similar to a king or queen or princess. So you can then embed things rather than having this nine by nine matrix you can have a three by nine matrix. So you can give terms instead of one, two, three, four, six, seven, eight, nine you can on femininity use and royalty. So man is not feminine, but a woman is a man is not young woman is not young but a boy is a girl is a prince is royal and is also a use and is not female princess is a female is a use and is associated with royalty. So in the encoded things is sort of a richer amount of data. And so this is embedding, and it has context. And this is quite useful actually in machine learning, but it's it requires some intelligence, and obviously different ways of embedding, we could have had three categories, two categories, four categories. Machine learning involves feature engineering we have to do things where we manipulate the data, we scale we transform normalize makes it more suitable for modeling. It improves the model performance, it reduces the impact of way out line data that's just too big or too small. So in fact, it makes thing sure that things are on the same scale. And so if you have you know numbers at point 001 and other numbers at 10 to the plus 28. You're not going to get much results. You have to scale these things. And when things are different orders of magnitude or different ranges you have to make sure that those especially numeric data is been scaled or normalized. This is important because we do derivatization we do derivatives we do logistic regression artificial limits SVM is wherever derivatives involved, you have to have some kind of scaling. And that allows the data converge allows derivatives to converge. We don't get, you know, not a number or zeros. We use normalizing which things get rescaled and shifted so they range between zero and one or we can standardize, which means that things have a standard deviation of one. But these are techniques that people learned over the years that really make a difference in terms of model performance. They can do transformation so we can take a skewed distribution of data and change it to like Gaussian distribution is called log transformation. This is often sometimes often used in data analysis and manipulation. This is a machine learning. Likewise not all the data is relevant for training. And so feature selection is something you can do both automatically or manually to choose the features that contribute most to the accuracy of the model. If you include irrelevant features like zodiac signs, then it reduces the accuracy. You can have, you know, seven features, you do feature selection manually or automatically get rid of four of them. And then so finally your model only works with three features to do its selection. Sometimes people have to little data, you've got two points and you can draw a line. If you include the other eight points or nine points here, then you realize it's not what it's happening. It's it's not a straight line. Or reverse L. So those are things about how to transform your data set a lot of mistakes are made in that selecting features a lot of people don't do it, or don't use intelligent feature selection. So all of these are learners or models, decision trees, neural nets, markup models, SVMs, GNNs, CNNs, belief nets. There isn't any way to know which one is best. So as I've said before, try a bunch of them. Some do remarkably good, some do remarkably poor. Decision tree is the simplest of all. It's easy to implement. This is a decision tree about who survived on the Titanic. You know, as a rule, it was women first. So a lot of men died. But of the males, they chose children to stay to be put into lifeboats. And if those were young enough, some survived as they're old than they mostly died. And then if there are lots of brothers and sisters, they also had a better percentage of surviving. So from this, you could kind of predict who would survive. And the computer can take the data and actually figure out how those decisions were made. And I think most of us know that women and children first went about thinking. Decision trees have branches, which we call edges, and they have nodes, which we also call leaves. And this scaling is not needed for decision trees. So this is actually really nice. It means you don't have to do all those transformations I was talking about because it doesn't require derivatives. A random forest is essentially collected a whole bunch of decision trees. This has three decision trees are shown but it might be hundreds. It's an ensemble learning method, and you can do with decision trees and with regression or random forest and so you can do both classification, and you can do curve fitting or regression. And it takes a collection of unconnected trees, and you do essentially prediction by committee. And a lot of people and a lot of governments and organizations work with committees, and everyone comes up with slightly different decisions but you kind of a majority vote or average those many decisions to come up with usually a final result and often those are much better than the single decision tree. A&Ns, artificial neural networks try and simulate the brain, their connections of nodes and units, these are artificial neurons and they're modeled similar to brains, and A&Ns can be used for classification and for curve fitting. Hidden Markov models are another type of model, they're probabilistic graphical models, they can be used to model sequence data or events over time. Markovian events, they use emission and transmission probabilities as showing how you can predict whether the weather will be leading to a dry or damp or soggy soccer field depending on whether things are rainy, cloudy or sunny. And there are various probabilities that are shown with numbers around them, various hidden states. HMM is something we used to teach, they're incredibly complicated, and they've been largely replaced by what are called LSTMs, long short term memory neural nets, which seem to be much easier to implement and also probably easier to understand. But they're very good for predicting time trends, sequential events, things like predicting the weather, but they're also been used to do, you know, identifying sequence motifs for many years and informatics. They've worked in vector machines, never understood why they call them that, because they aren't machines, they're algorithms. They use something called a kernel trick, but it's another transformation trick to take the data, finds a boundary in multiple dimensions, we call it a hyperplane to classify things. It's very similar to something that was developed in the 30s called discriminant analysis or linear discriminant analysis. These algorithms can be used for classification, biomarker finding, they can be used for regression as well just like the neural nets and random forests. So these are different models, and I could go on and on that we don't have time. The testing and validation is another one where many people make mistakes. It's trying to avoid using too few parameters, which we call underfitting, or the most common mistake is overfitting using too many. These are examples of underfitting where you've got a whole collection of data points and you just draw a straight line. It's probably not the good fit. You can see in the middle one where it's just right. And then the tendency for most machine learning algorithms, especially people new to the field is to do the thing on the far right, which is, you know, connect every point. Overfitting means that you basically model noise. It means that your predictors will be terrible. Underfitting is rare, but it typically happens when you just don't have a whole lot of data, or you didn't do a lot of training. As I said, this is something that almost everyone falls into the trap. Overfitting is one that I see just about everyone, even the experts do. The event that is you have external validation sets. You need to do cross-full validation. You can do leave one out or you can do permutation testing. So typically people will take their data set, their training data, and then divided into two groups. About two thirds of the data is training. And then one third is what's going to be you. It's going to be held out, and it's going to be used for testing. When you train, it must never see what you call the test data ever. You can then divide your training data into smaller parts, which can avoid a problem called training bias. So if this is your training data, which is now just two thirds of your total data, we can take the training data. There's 66 examples out of 100. We can divide it into three folds. So 22 sets of each, and we can do three rounds of testing and then training on that first data set. So we train on two thirds of the 66. So that's 44 and test on 22, but we give it up in different ways. And this is how we do our training. But then we're still going to have our outside or holdout set, which is 33 examples that we didn't see here, which we're going to do to validate. But this is a way of making sure, especially if you have enough data that you're not getting too much of a bias. You can also do leave one out, which instead of dividing into thirds or threefold or fivefold, we just train on everything except one example and then repeat the process by, you know, taking another one to take out. So instead of doing, you know, three rounds of training, you might do, you know, 66 rounds of training. It's maybe not the best method, but it is one that people have used. Permutation testing is another approach where let's say you have some, you know, unlabeled data. It's this big cluster in the left corner, and then you've labeled it. So the reds are one label, and the blues are another label. The classifier that groups things, and that classifier pulls things apart very nicely and you can see the red cluster in one side and the blue cluster in the other. So our classifier has done a nice job of separating this labeled data. We can then say, well, is this a good classifier? This model, is it good? Well, then we can relabel our data, which I've done just below in the middle. So I call it a permuted data. You can see the reds are different, the blues are different. I run the same classifier on that permuted data, and then I see what I get. And you can see in this case, it didn't separate them at all. And I can do this over and over and over again, and I can measure how well separation is done. And in this case, the one in the upper right corner, the classification is very good and its separation scores excellent, and it's way above the norm. And all the other ones that we ran, they just clustered all together. And so they're in this big cluster in the left side of that graph. So the arrow marks the classifier, and we can say this classifier is really robust. It's, it works for the data, it's not just a half instance where everything is, you know, classified all the time the same way. So it's not over trained. And it works. You can assess how things are done when you're, you know, classifying things, we talk about true positives, false positives, we can measure sensitivity and specificity. This is called a confusion matrix. It can be a two by two, it can be a three by three, four by four, but it's essentially measuring your performance and how many times did you make mistakes, how many times did you not make mistakes. How many were right when they're wrong and wrong when they were right. We can do this through something called, you know, looking at patients, there's collection of, so there's 1000 patients 500 or positive 500 or negative, they have different scores, genetic scores, metabolites scores, protein scores. The ones in orange are the negatives, the ones in blue are the positives, and there's going to be some overlap. Some of them will have measurements that are, you know, would say that they're negative, but they're actually positive and vice versa and you can see those regions marked at the bottom where we marked false positives and false negatives. And again, we can look at that overlap of those two distributions and calculate our performance or sensitivity or specificity with SN and SP. And we can use something called a receiver operating characteristic curve to measure that performance to measure the true positive rate and the false positive rate for classifiers like a binary classifier. So here's our distribution, the red and the blue. And if we cut them up into different regions and say, the far one on the sort of, I guess it's a purple one. There's none of the red there. And then far left, there's a kind of a brown one. There's none of the blue and that one, but in between, there's mixes of the red and blue of both populations. We can calculate how many times based on these colored dots and those colored lines, how many times we had a false positive and how many times we had a true positive, and we can plot it on this curve. And you can see that the area under those curves is sort of plotted in this, in this graph, and we can draw a line on that graph. That is a rock curve. So these lines correspond to each of the colored dots based on the area that we calculated under the red or the blue, which corresponded to the X and the Y axis. And there's going to be a trade off. Some are going to make things more specific. Some are things can make things more sensitive. And this one is low sensitivity and high specificity. This is high. This is low sensitivity, but low specificity. And this one kind of at the corner or the bend zone is the best. And we can calculate the area under these rock curves. And this gives us a performance. A straight line is terrible. That's a random predictor. A rock curve 100% or 90% area in the curve. That's a very good predictor 70%. It's not anything to publish or write home about. But if you're testing and validation and make sure that your model is doing really well or sufficiently well, then you can publish it or tell everyone else. Come, come use it. And we've done this for a few things. We've done this for predicting MS spectra using PID Markovs and artificial neural nets. We've tested it, validated, published it, put it on a web server. Now, lots of people can use this tool to predict mass spectra. We've done this for protein secondary structure. We validated, tested it, performed all the assessments, then we put it out on a web server. So, you know, we've done this for lots of different examples and many other people have done these, but this is what you do when you finish your predictor user. Release it on GitHub, you publish it as a web server, or you use it internally and sell it to other people and make money. So, a lot of people in machine learning business work. So we're going to, you know, over the next two days show you two types of machine learning models, a decision tree and artificial neural net. We're going to apply them to general classification, secondary structure prediction gene finding. These are kind of toy problems, but they should give you a better understanding of how these work. We're going to deep dive into the algorithms. We're going to use Python or use Google Colab. And then we're going to show you how after climbing Everest through the hard route, we're going to show you how you can helicopter up to Everest using Keras and scikit learn and do the same thing and code it much more quickly using these, these other tools, and that'll be done tomorrow. So this is the end of the module. I think we'll just make sure that everyone has been able to, you know, get onto the Google Colab. These are slides that all of you received. If any of you haven't got these things and haven't been able to do or install or even create a Google account, then you should follow these slides or use the break that we have to go through this. So these are all something you've got. I'm not going to go through this again. The different libraries that we'll be using also about the R and where to get the student pages, the course repository, hopefully everyone can get those. And then because of these libraries like scikit learn, Keras, TensorFlow, Azure, PyTorch, Weka, a lot of the things that are going to look pretty difficult today can be done much more easily with these libraries. And we'll see how to do that tomorrow. So I just tried to make sure that you have a basic understanding machine learning, understand how you couldn't use it, can use it for pattern finding, fitting prediction, biomarker identification, and many different models. It's used in many things that you use today. And it's certainly becoming much more accessible because of some of these things and we'll learn about how that can be done tomorrow, as opposed to today. We've talked about the history, the applications, the data access or the ideas and how you scope your problem and frame it are really critical to making it work for you. So I've gone over time. I also do want to emphasize that, you know, machine learning isn't for everything. In fact, many common multivariate statistical techniques like principal component analysis, partially squares, discriminant analysis, logistic regression are all technically machine learning methods. They're very old, but they work really well. And there's really nice programs. Metabol analyst is one where you can do biomarker identification pretty automatically with just about any data set. So you don't have to know machine learning, you don't have to know programming, you just upload your data set and metabolism analysts biomarker module can do your biomarker identification using either traditional statistical methods or sort of these multivariate statistical machine learning methods. It does offer support vector machine models, which are closer to machine learning ones. But those are examples where you can just go to a online tool and get your biomarkers discovered from pretty complex data can be genetic data proteomic data and tabloid with data transcript and with data, which how analyst handles it all clinical data and determine or identify your biomarkers. So you don't need to do any coding. But I think a lot of you here are here to learn a little bit more about how to do just that sort of thing so you can customize your models.