 So we're in the home stretch here, usual slides about Creative Commons, so we're talking about coding and information extraction with ChatGPT. I think most of you, or many of you have heard of it, most of you have tried of it, tried ChatGPT, and just in terms of context, this is primarily a lecture. There's a short lab, which is why we asked some of you to try and get accounts for ChatGPT. And then at the end of this, which will be about an hour or an hour and 15 minutes, we'll wrap up. There's a survey that we do at the end of these workshops to get some feedback and advice and comments about what we can do better the next time it's offered. So as with everything, we start off with a learning objective, so we're going to be learning about ChatGPT, about other chatbot systems or large language models. We'll also learn about other types of options or alternatives to ChatGPT. The main point really is to show how you can use ChatGPT to facilitate your work in bioinformatics. So we're going to show you how you can do some coding with ChatGPT. We're also going to talk a little bit about information extraction and something called in-context learning and how that can be used to do information extraction. So probably everyone's heard about ChatGPT because it's been in the news a lot. There's a Hollywood strike because of it right now. There's been lots of debate. There's new laws, I think, starting up here. People signed letters. Jeffrey Hinton resigned from Google because of ChatGPT. It was developed by a company called OpenAI. The company started about seven or eight years ago as a not-for-profit foundation in San Francisco. But over time, they were finding that a lot of people were investing money but not making money, and so others wanted to put money in but convert it to a for-profit model. And so Microsoft actually switched or converted OpenAI to a for-profit model. Among some of the people who started OpenAI was Elon Musk, but he left I think in 2017, so he's no longer been part of it. It was probably because of the shift to the for-profit model that OpenAI started actively working on ChatGPT. So it hit the news at the end of November with a release called Version 3.5. And I've mentioned this before, about the $700 million to develop it. About 350 people, they started in 2018, 2017, roughly when they became a for-profit one. They used a model similar to a graphical neural net, which is called a generative pre-chain transformer. And transformers, architects you showed on the right here, upper figure. And they're using things like fee form, that's something you guys learned about neural nets. It's using something called attention systems, which is how long something should focus. And so this is something that can be encoded. They use embedding, which is something you can standardly use in text processing, as opposed to say the one hot encoding you've talked about. There's additional positional encoding that's used. You can see some of the activation functions there. There's a softmax function. There's a linear function. There's various other thresholds that are done. But it's a complicated architecture, more complex, obviously, than a neural net. The large language models, which chat GPT is one of them, use huge amounts of data, billions of words. So to train chat GPT required massive computers with specialized GPUs or graphical processor units, running for months to even years to complete the training. As I said earlier that really, as a generative model, what it does is it generates text or words. With models, you can see those with your auto suggest or auto fill in your cell phones when you're texting. And it takes the first letter or two and suggests a word. If you have some of the auto fill or auto suggest activated with your email systems, they can also suggest certain phrases or words based on their frequency, which you say or use those. So, you know, rather than just a single word or a few words, chat GPT can generate, you know, three or 4,000 words at a time. Now large language models can process text, they can some content handle voice, some can handle images. And in terms of what they can do, they can do things like follow instructions, they can recognize objects, they can caption images, they can do question answering. They can do something called sentiment analysis telling whether it's a positive or a negative feeling, or they can do information extraction. These are all applications general applications of large language models. And a lot of those things represent what we would call intelligence. So, in June of 2016, open a started working on generative models. And that was their open source model and so they actually described GPT to and I think that was mostly open sourced. Other models as well. January, they created something called instruction instruct GPT, and that is something that is able to follow instructions really well, and this has been really important because when you ask chat GPT do something, it has to understand what you're asking. So about a year later they released the full version of chat GPT GPT 3.5 and they made it as a web service. In February this year they had chat GPT plus as a subscription, which basically allowed you to access chat GPT for or enhancements beyond that. Now, Bing, which is funded chat GPT also came out from Microsoft. The API came out in March, the 4.0 was formally released with chat GPT plus and a lot of data showed and a lot of people have been playing with chat GPT actually since December or January and it was performance is quite a bit better than version 3.5 So a whole bunch of plugins have been introduced and those are piling on macOS app has been introduced for chat GPT 3.5. Bing now powers chat GPT web browsing, and then there have been more prompts introduced for people to have smarter interactions with chat GPT. So a lot of stuff has happened in the last year and there's more stuff going on and it's very hard to keep up. So a large language model is something that has billions of weights in it. So a lot of things about those could be one to two billion weights that have been determined by training tens of billions. In some cases now they think it's trillions of words for chat GPT for that have been scraped from the internet. So that includes, you know, documents, databases, Wikipedia abstracts from PubMed also includes, you know, text messages that people have been sending. More information about how people talk, communicate. And of course, through that it's getting knowledge or fats. So the critical thing that they call at least things is GPUs but you can call them AI accelerators and they're able to process huge amounts of text. And as I said, the concept of a large language model is to take an input text, or generate first word and then try to predict the next word or next token, based on the previous words, or on the instructions that it's been given. So, if you wanted to have a 12 billion parameter large language model. It would require 72,000 GPU hours to train. The GPU operates at probably 100 to 1000 times faster than the CPU for certain tasks. And so this is just, you know, the scale and the cost of the computation of these things. So, in terms of the details. It's, it's called probabilistic tokenization tokens are about four letters. And before you'd almost think of it as syllables. And they call these syllables n grams. The n grams or syllables are given certain numbers. So you can see this example of some text tokenizer colon text goes to a series of numerical tokens. Five or six word sentence. You tokenize that six word into token. And then that's one syllable iser is another syllable or pseudo syllable colon is another symbol text is another one arrows another one series of, and then there might be newer rather than numerical. There's quotes and then create other tokens for T and O and K and ends. So what they have done or what these GPT systems do is they learn the probabilities these tokens by training and optimizing just like we've talked about, you know, have batches and then epochs and you take your training data. And you, you know, process it and incrementally improve it. And, you know, you say, okay, I'm supposed to be able to say take tokenizer text arrows series of numerical tokens. So the next time someone shows me tokenizer, will I write exactly the same thing or predict what which words should follow or which tokens should follow afterwards. And, and some of it is based on context. Some of it's based on a prompt, some of it's based on the data that you've collected. And if you think about it, that's very much how we talk and how we think we are generative word generators. And those are driven either by instructions, or by, you know, internal cues that we have about what we need to say. So, you know, you can get so far with, you know, predicting what words be next, but there's a process called fine tuning, which is a training. So after you've done the general training, then you do the fine tuning, and you go through corrections or you specialize in certain text or subtext or groups. A lot of the fine tuning for chat GP to was done through reinforcement learning and formally it's called reinforcement learning from human feedback. That was the same reinforcement learning concept we talked about in the first lecture about how cars learn to drive. And particularly it's good for robotic performance, but it's also good for trying to get things to behave more like humans. So actually open a higher, I hired lots of people humans to manually correct and reinforce responses. They also, the reason why they released a chat GP time on the web was to get even more feedback to get an even better model. And so when they released in November, that feedback was used I think to refine and improve chat GP GP 3.5 to four. You can also, you know, get to the point where it's training itself and this is what was done or what happened with chess and checker playing games. They would, you know, play against themselves. And you can do self instruction to bootstrap quality or improve the responses. So one thing that was surprising and maybe unexpected was as you grew these large language models from 100 million to a billion to 10 billion. And I guess in terms of the number of parameters, the biggest ones is somewhere around almost a trillion. So that's cheap chat GPT at 100 billion palm. I think that's Google. All of them are well over 100 billion. You can see in the lower table, the size of some of their parameters. Chat GPT, chat GPT 3. Some of them, you can access their source code. Some are available through API. Some are not accessible. And then there's the commercial ones. So the original version of chat of GPT came out with 100 million in 2018 GP to 2019 was one billion parameters. And then there's the dialogue GPT in 2020 GPT Neo in 2021. And then there should be instruct GPT somewhere around their GPT 3 had about 100 billion parameters. And then chat GPT in 3.5. When things get up to around 100 billion parameters, they start to get knowledge about syntax about semantics and about ontologies. And those are sort of inherent in how humans write language and communicate. So there seems to be this threshold that where this thing gets, you know, human like or smart. And again, it's not fully understood people are still trying to parse this out and why did this happen, how could this happen. These are some other graphs showing some of the early models and how many parameters were used. So there is a Google parameter model of 140 million GPT 2 was 1.5 billion. Lama 65 billion. GPT 375 billion go for palm, palm E. And then the chat GPT for they think maybe close to a trillion parameters. So growth in size somewhere around between 150 to 300 billion parameters things start to get intelligent. So we talked about some of the applications I showed in the earlier slide information instruction information extraction instruction following question answering summarization. But some of the other things that people have used chat GPT for to try to write cover letters to plan trips to provide professional advice, there's this GitHub with a whole bunch of prompts. And do grammar checking code debugging coding advice coding, adding code comments, menu planning recipe writing, writing poetry, writing music lyrics, writing essays, writing stories, and writing scientific papers. So at this stage that I'm going to stop a little bit and go to you guys and ask if and what some of you have been able to use chat GPT or similar tools for does anyone want to volunteer things about what they've had fun doing or try letting chat GPT. Yeah, so I've been using chat GPT pretty much since it came out. And what I've been primarily using it for is to develop an application, like a Python application. Actually, before I started using chat GPT I didn't know much of Python, but I've noticed that if you can ask it specific questions the same way that you can like, search it up on stack overflow, and you can, if you chain logic so like one request is referenced in a one request, it gives like really, really good results. Yeah, that's a very good point and really important application. Kylie. I just wrote it in the chat but yeah like my personal favorite is actually trip planning it's great it saved me tons of time I say I'm going to the city for three days what should I do. It's a three day itinerary it's great. Sometimes writing like if I'm really stuck I can just it just. It's often wrong not exactly what I want to say but it's a really great place to start and kind of helps me get through those writing blocks. Coding help and stuff. Yes, it's been really, really helpful. Anyone else with comments. So, um, I had a problem for a while because I was, I work on like it's a dog genetics, dog cancer, and a lot of the variants in humans are really well annotated and they tell you whether it's a, what kind of, you know, if it's a hotspot mutation or something like that. So we always had to blast our, our sequence against the human sequence to find out that location, because it might be very conserved region. And if it changes to the same amino acid or something similar then we can also say that okay this could be a hotspot mutation in dogs as well. So it's a great problem for us because we don't have huge amount of data. We just want to compare, compare the, you know, the changes, what does change signify in humans. So, so instead of like, I was trying to write a code myself and I'm like I wasn't sure how to do that so I asked chat GBD and so it initially did not was not able to help me but I like I said I asked a very specific question because my idea was to just why don't I create a blast database using this only cancer in 900, 900 human proteins and then blast, blast create a database with that and blast, blast all my dog data against that. And then can I pull out the information from that and that's what I would ask chat GBD write me a code how can I pull that information Amino acid say a in position 58 in dog what is the position of that in humans. And so that what I asked, and it initially give me a Python code and I said like I don't know how to manipulate a Python code because I'm not used to it. So it's right the same thing in our and it did. And so but it still couldn't give me the exact answer but I was able to within a week sort of make that happen and it works pretty well now. So it was, it was very helpful in that way. Good. That's, that's very helpful to know. Page. I've used chat GBT a lot to write me protocols on like transferring data from one type of instrument to another like I have usually like a CSV exponent file and I try and get chat GBT to write me a protocol and step by step process to how to upload it into like a self max plate reader. So it's very helpful for that that kind of stuff where the protocols for that aren't on Google anywhere. It didn't work exactly but it gives you kind of a good pathway to try to do it. Yeah. Yeah, that's cool. Yes, I tried using chapter GPT for some Python coding and I found it kept giving me a lot of deprecated code that drove me up the wall. So I, you know, I ended up going back to stack overflow and checking for this and that so I was a tad thrown off, but it's probably improved and probably gives better code now so I don't kind of waste time. So, I don't know, maybe, and I have a chance to sign up for chat GBT for yet, but it's on my to do list. Yeah. Yeah, I've heard that from a number of people about the tendency to use deprecated code or functions. And there are different chat bots, both version four but also other providers that seem to specialize more in code and produce higher quality code. Okay, so I think, again, it's it's nice to hear just people's thoughts. I was at a dinner once with a musician and he was fascinated with chat GPT and he was showing how you could get chat GBT to write actually really good music lyrics in various musical styles from similar to various musical artists. And he was as a musician he was impressed. I was equally impressed. So, I'm going to switch back to our screen again. And I'll continue with the discussion. So I think so you're also looking at how we can use chat GPT for trip planning. I'm just going to make sure that this thing is so other applications are language translation. Web search role playing has been another one topic summarization copy for ads. Brain storming virtual assistant tax tasks qa or customer service interpreting queries for music and art generation. So there's Dolly that some of you may have heard of mid journey music lm people used to solve riddles puzzles, crosswords, running jokes, chords for lyrics, gift recommendations, making tables as well as text, classifying information, getting product descriptions. So these are maybe more obscure but these are ones that have been successfully used. So, there's a lot of hype and these are performance tests that chat GPT 3.5 chat GPT for chat GPT for with no vision or able to do. AP stands for advanced placement. So these are, you know, high school, 12th grade exams. So considered the peak of a person's knowledge people got AP credits usually got first year university credit. So, it's chat GPT was able to take these tests. You can also take the SAT test. This is a test that's commonly done in the US for college admission, and it does very well. So it knows art history and psychology knows US government US history, SAT math, the graduate record exam. So, at least with chat GPT 3.5 it was hovering around 65% with chat GPT for it's almost 100%. The writing component, not much better else that that's the legal lawyers exam so this is to get into minute to law school. The chat GPT for gets you up to around the 80th percentile. There's other exams from chemistry quantitative work. You can see where it starts failing. Badly things like English literature, English language. This is where it tends to fail with sort of open ended questions, which is typical of write an essay on on the meaning of life. And that might be something you get with an AP English language exam or an AP English literature and talk about why Shakespeare was important. And so you would get this thing called hallucination and it would do pretty badly. Some of the things that people have tried with chat GPT especially chat GPT for it actually passed not only the LSAT to get into law school it passed the New York State bar exam. The assessment on SAT it would have got about the 90th percentile so it would have been probably good enough to get into universities like Berkeley or elsewhere. GRE. Very well, US biology Olympiad exam 99th percentile. All the high school AP exams were passed by chat for. It also can be a sommelier, a wine taster tester. It obviously kept some taste wine or drink wine but in terms of its knowledge about why it passed the introductory certified in advanced ones. It passed the Wharton MBA exam, and also passed the US medical licensing exam, which would be sufficient to become a doctor. So, these are pretty impressive achievements. And, you know, it's able to do this 24 seven anytime of the day does it without studying. There are limitations with chat GPT. So the data it's knowledge basis only current about 2021. Weirdly it doesn't look up things like Wikipedia or the dictionary. So, you know, who is Donald Trump and it might write some kind of hallucinatory description, whereas you can just look it up on Wikipedia. Or what's the definition of onomatopoeia. Again, it won't look up in a dictionary. More advanced methods or groups like chat sonic will use things like knowledge graphs and lookups to get answers and to access the internet. It can't tell time can't tell dates that can't do simple math. You guys can probably try doing math or can get time information from Google. It fails at these open ended questions. Why was Shakespeare important. It can't do that, which is why it fails these AP exams, I guess in English literature. The only complaint is it doesn't provide citations or references to its statements or some race, and as a result it will hallucinate. So the term means generating nonsensical or false statements. Some of you guys could do this right now but you can have it, you know, write a biography and put your name in and see what it does. Unless you're really, really famous. It'll make for some pretty interesting stories about you. So if anyone's able to do that. Maybe in the next minute or two they can read their biography as generated by chat GPT. If you've never gone or tried that cheap GPT we've given you guys some descriptions so you can go to chat open AI calm. You can log in or sign in sign up so. If you haven't signed in please do if you have a login and you can just log in. Enter your email and they have to give a password and that's how you start accessing it. And in my case I don't have chat GBD for chat GBT plus I can just access chat GPT 3.5. And you can see where I've typed in a query this is a question answering thing. So explain what hidden Markov models are. But there are other things. Examples of explain quantum computing in simple terms, which they do for a 10 year old's birthday. It allows you to follow up on corrections, but as I said it can hallucinate. So maybe I'll just stop here did anyone try chat GPT and ask them to write a biography about themselves. I did yes. Do you read it to us or is it too personal. No, no, no, it actually it gave me an error and so that I wasn't famous before 2021. And so or after 20. Yeah, after 2021 so it didn't have anything and then I revised it to include like doctor. Jennifer get us Macauster and then it generated that I'm a famous neuroscientist and won a Nobel Prize in neurophysiology. So thank you. And at the bottom it said that this is fictional and not based on a real person. It's kind of fun. Okay. That's interesting. Anyways, it is a failing and I know a number of people have tried. I don't know Nia did you ever get a biography of your yourself written by David on that right now it also doesn't think I'm famous but I'm hoping that it will similarly manifest something so positive for me. Okay, show that you have a question. Yeah, she was going to comment. I asked for a biography about you David and it gave a beautiful biography. Oh, okay. He writes about you very well, because you are famous. I don't know, you'd have to I'll have to check to see if it's factual but that's that's I haven't done that so I'm glad you tried it. It's interesting. Anyways, I think some of it will remember if people have, you know, modified it so if you guys have asked to write a biography and said you know, correct correct this or change this. They will remember. So the next time you ask it to write a biography, it'll be a little better. Okay, just those controls. And so this is the output after to write about hidden Markov models and it wrote and wrote and wrote. I could have said, you know, explain what Markov models are in less than 2000 words but it's, it's quite good actually, in terms of this description, and it's not taken from Wikipedia. And I find a lot of the descriptions for Wikipedia, either in computing or math are written by someone who likes proofs and lemmas. So the versions that are produced by chat to be a much more readable and often much more understandable. In terms of chat gpt plus it now gives you access to gtp 3.5 and four, and you can interact and retrieve. So if you want this gtp plus you can pay about $20 a month USD. Now there's a limit. You can interact with about 50 times every three hours, or about 400 times a day. If you're working 24 hours with chat gbt 3.5 you can hit it as many times as you want. The, there's an app. And it's available for iOS and Android, but it's not available everywhere. And I'm not sure if it's even available. Some of them are, but the US usually has a lead access for a couple months. So you can put the chat gpt into your applications you can put them into products you can put them into services. So you can have conversational interactions so this is why you can have a customized chat bot or customer service tool. And so you can access that API through chat.com. You have to get an API key and follow documentation and create in their specific application so our lab has purchased an API. Has anyone else been able to get an API how many people have. I guess you'll have to put your hands up or put something in, in the chat session so someone can tell me how many people have actually got the API. How many people put their hand up. I don't see, I see none here right now. Okay. But yeah, we were chatting in my group, I think one person had used it. I'm not too sure but yeah, so keong has it. Anyways, you can depending on the model and number of tokens you use because there is a cost for the number of tokens. And we talked about this word, this comment before tokens are small words or syllables about four characters long. So 1000 tokens is 750 words. So, one and a half cents for 1000 tokens for the API for the turbo model. This output is point two cents, and it's somewhat more with GPT for typically it's three cents for 1000 tokens for input and four cents for six cents for 1000 the output. Now, chat GPT three has I guess code or coding capabilities. Remember, there's 3.5 that's the web version for that's the one that cost money. Three, you can do this thing called fine tuning. So it's it's not like trying to learn the original model but it's trying to customize the model to do specific things that you want it to do. So fine tuning, you can either run it on the platforms the GPUs that I guess open AI has, or you could run it or models similar to it on separate platforms. So with a DaVinci, now the cost instead of being point one cents or are they get to be more expensive. So fine tuning is three cents for 1000 tokens or 12 cents for 1000 tokens with the fine tuned one. It's the most expensive thing to do on chat GPT but it's often needed for companies when they want to have specific knowledge. So, when you start calculating the cost if you're looking at, you know, thousands of words hundreds of thousands of words millions of words costs start to mount. So there are a number of alternatives. Microsoft Bing, Bard, Clode to and Lama to how many people have heard of chat sonic. Maybe just put your hands up and some of the tears can count. Any, anyone who had a hand up for chat sonic. Not yet. Okay. So this is one that actually in many respects is probably better than chat GPT that's actually built on chat GPT, but it uses knowledge graphs. It has citations, it has access to the internet, you can use audio, it hasn't built an image generated like chat GPT is not open source, and you can't quite access it as many times as you might want to as say chat GPT 3.5. But given its capabilities, probably most people wouldn't want to use it more than 25 times a day. So, in many respects I think check sonic is almost heads and shoulders better than chat GPT. And there's Bard. So, it's built on a very large language model palm to three uses knowledge graphs, which chat GPT doesn't. It's very fast. It has different behavior. So it's trained by people and fine tuned I guess by different people so it just is like a different entity. Like chat GPT like chat sonic bars not open source. Unfortunately doesn't provide provide citations which is a limitation. Which is also another Google also anthropic version so it has different large language model technology most the other ones. It's very good at writing code. So, a lot of you guys have mentioned application for code, this one seems to be better than chat GPT does text extraction it's fast. It's not available in Canada yet. And it's not open source. I'm sure velvet in Canada is going to happen very soon. Bing. You can get it immediately. It's free performs web queries gives you visual answers. It's powered by chat GPT for. So, if you don't want to pay the $20 to get chat GPT for just get being it gives citations. It's very accurate, but it's slow and not open source. Very positive. So, as you mentioned one of the issues that we have with chat GPT is that it doesn't have access to internet. And it doesn't have access to current information. It doesn't provide you information where it got answers for your questions. So being allows you to search internet or get current information. And it gives you references. So you can go to the references and see if it hallucinated the answer or if the answer is real. So there's way to verify. It replies from being unfortunately sometimes it's because it's popular you lose connection to server. It limits how much you can ask it for days or the sun cons but generally it's it's really great. Yeah, I think a lot of people complain about the slowness or the fact that you lose connections but yeah there's positives with it. The other one is Lama to so this is produced by Facebook slash meta. So people have commented that it's really good at reasoning instruction following good coding abilities. You can get models with the 70 billion parameter model. It's been trained on two trillion tokens or one and a half for one and three quarter trillion variables it's freely available but more importantly, it's the only one of these ones that is open source. It's probably better than chat GPT three. It's maybe not quite as good as chat GPT 3.5. And so the other commercial ones in terms of human performance tests. But it's evolving because if it's open source nature so you know this was written a couple weeks ago so by now it may be as good. So the fact that you can run it locally and install it locally is nice but to maybe do it properly you have to run it on what are called a 100 GPUs these are the cream of the crop GPUs about $15,000 each. So usually need two or three of them you need at least 80 gigabytes of RAM to run the 70 billion parameter model. So these are not commuters that most of you have around at home I don't think. But you know that you can also run in the cloud and I think that offers it that way. So, those are some options. That's a small list is probably more than a dozen that are appearing now. We're choosing chat GPT part because it's the best know probably not the best. But because it's so accessible and so fast. I think it's a good one to to play with. So one of the applications of course is coding with chat GPT. So, you know, you can start a session, I can start a new chat or you've got this empty thing here you can just type in something. So you can submit any query, you can click new chat or type instructions in that query box in chat GPT. So here's an example query, write a Python program for translating DNA. It can be non specific and this is an example of the very open ended query. But you could probably, you know, be more specific site write a Python program using non deprecated Python code for translating DNA in, you know, six reading frames, where, you know, the query DNA sequence can be entered through a prompt enter DNA sequence here. Anyways, that's a more specific query. And by giving more specificity to the query, or the instructions, you're more likely to get more complete and correct answer. And this has been the problem a lot of people don't know that a lot of people don't know about the importance of what's called query engineering. So here's the result. And what you get is something that's kind of cute it says, certainly DNA. So it says yes I'm happy to do so DNA translation the process of converting DNA sequence into corresponding protein sequence and the other genetic code. Here's a simple Python code to perform that translation so it's given you an answer. It spits it out. And then you can see that it's produced all the codon 64 identified stop and start codons. And then it says, give you gives you instruction replace the DNA sequence with your desired DNA sequence and the code for output the corresponding protein sequence. Note this example since the DNA sequences a valid sequence the length is divisible by three. And then it says from chat to 3.5, you know, tells you what it's going to do explains it. At this stage you'll notice it doesn't, well it even has comments so there's an example DNA sequence, then it says perform DNA translation, define the genetic code. So from a quick view it looks good. If you guys know about Google Colab you can cut and paste that code into Google Colab. And as some of you guys have noted, not everything that's coded by chat GPD is correct so this is something that Vasu asked it to do and when you put in this DNA sequence. It was divisible by three. In fact, I think it was divisible by three, but somehow it got confused. So, the point here is that the chat GPD code is generally good, but it's not always perfect. And that often means that people need to tweak the code a little bit. So it failed here. And if we do a little bit of debugging. And then the original code, and then the code that Vasu implemented. There's about four or five lines have changed. And now it's able to read the protein sequence more appropriately. And when you did the query. It does the correct translation. It's, we'll say maybe 95% correct and it did, you know, the tedious part by getting the codons. It understood what it was supposed to do. Certainly the ability to read input was correct. And it got you a good ways there. Any questions or comments about that. My side, of course, just in addition to what you just said about the query engineering. So, yeah, so even chat GPD gets better with each prompt and each query. So the way we prompt it. So it's supposed to get better. In many cases, it does get better. And also, just a fun fact there now real courses going on on query engineering. So people are doing those courses, not for chat GPD, but AI, AI query engineering, there are some some courses out there that people are getting. You know, yeah, I'm getting paid to become query engineers as well. Yeah, I'll just go back to one thing that people could take a look at the long ways back so you'll see this providing professional advice so if you take this website GitHub trying off ski AI professional prompts tree dot slash main. You can either cut and paste that or type that out or just search it called AI professional prompts. You'll see what a really good prompt engineering thing can do. And it's just, you know, very explicit instructions. And, and then you can take those explicit expressions and see what it does. There's other ones that have also good prompts. So that was translating DNA. It's sort of our or finder if you want. Then let's try and write a Python program to not only translate but to do or finding in DNA sequences. Here's a Python code that finds open reading frames in a given DNA sequence replace a DNA sequence with your variable interest code will find works and all three reading frames and print the identified works in their indices. This may not cover all cases you might need to refine extend the code based on requirements. So we can take that code into Google co lab. And it actually does find the or their very short ones, but it's a short sequence. And so in this case, it's probably more efficient than the original or code that we wrote. But it's also a little more limited. But this is an example where we got it first hit. And this is again a query that that would submit it. I presume it was your first shot at or had you refined the query a few times was the first shot. So, again, since you guys have access to chat GPT, maybe you can try asking it to write up some code. Now it shouldn't be too complicated, which is, you know, like me, you carry that gene finder code in our. That's probably too complicated. But you could ask it to write, you know, a neural net model that is also an or finder or something like that. I also want to point out if you're not keeping an eye on the slack you're missing out. Mark has been putting in some really interesting sort of auxiliary information, extra credit for reading. It's really good stuff. Okay. So what you'll find and you guys can chest this whether you want to do it in our or Python. It won't always generate perfect code, but you do need to do a little bit of debugging version for if you're going to pay for it is better than version 3.5 or if you're using Bing. I guess it can also, we can run that. But remember the other ones that I've mentioned, like, I guess, Bard and Lama, those apparently also probably better than chat GPT for code generation. Certainly you can, you know, with improving your query refining it or starting right off with a good query. It will start improving or enhancing some of those things. And so this is something that Mark was doing and trying to revise the R module for the gene predictor. It goes back to the point about optimum performance, the more complete, the more precise, the more comprehensive, the more accurate your prompt, you know, the better engineers your prompt, the better results you're going to get. So that's for coding. You can also take code and ask chat GPT what's wrong with it. You know, it's not working. Can you explain where the errors are or you can take code and ask it to add comments to the code so they can better understand what's going on. So those are other components, other aspects of chat GPT and maybe for next year we'll add some examples of that as well. Another part for chat GPT is in the area of information extraction. So it's IE, it's shorter, easier to say. So you're getting information from a topic and you're retrieving it from sources of text. It's trying to take free text unstructured data, sentences, papers, journal articles, and putting it into sort of a structured format, a database. To delete IE, you now have to do named entity recognition, you know, to recognize nouns and recognize what types of nouns they are, and you have to then target that or categorize that information. So why is information extraction important, and from the field of science in general, not just bioinformatics, if you're around in 1660 you only had to read two scientific articles, journals. Right now, there's about 60,000 scientific journals. And within the total number of articles is about 500,000 medical articles per year. That means if you're a very good reader, you could read all of them and one article per minute. If we include all scientific articles, there's a 1.5 million published. So currently PubMed captures a good portion of those, and there's about 35 million abstracts from 20,000 journals. So it covers one third of the scientific journal population. You know, just in the last roughly 40, 50 years, 70% of all scientific articles have been published. So it is climbing, if not linearly, probably exponentially. So if you are someone who studies, say breast cancer, as far as 25 years ago, you to stay current would have to read 27 papers per day and scan at least 130 different journals. So given what this is 25 years ago, you could probably double these numbers. So if you're studying breast cancer day you'd have to scan 250 journals and read at least 50 papers per day. I don't know anyone who has the time to do that. But this is a question that a lot of us are asking how do we keep up. So one way of doing this is through information extraction and people started thinking about this, even back in the 60s of course computers weren't powerful enough to do that. There are some other programs that started appearing in the 80s and 90s. Some of them were done for Reuters, the news gathering organization. More recently, there's been a number of open source tools that appeared. These are things that you can download for free and that can do the name identity recognition that can help with some aspects of information extraction. You can also do information extraction with chat GPT. So here's some text couple hundred words about artificially intelligent robots and the role of Valentine and intelligence. And so you can ask it so here's the text extract the data classes from above in a simple JSON object. Give me the century the date the scientists the paper. So, you know, you can do this yourself. You'd have to read it, but here's the answer. It found, you know, this is about the 20th century scientists is all entering the date in this case is 1950 and the paper that was wrote was computing machine intelligence. So, you know, you're taking text and you're writing it into a format that could be used in a database that is pure and efficient information extraction. And the reason why it can happen is because the large language model like chat GPT is able to recognize entities. It can also categorize things, and it can extrapolate things, and it has context information. A lot of these open source tools that are around didn't have the large language models they have, you know, vast collections of code to look at things that huge look up tables and other tricks. But all of that is sort of parameterized in chat GPT and other elements. So what if you want to do more sophisticated information extraction so one of these things is the area of sentiment analysis using in context learning or query engineering. So ICL or in context is prompting a method is a prompting method that allows you to take an already existing LLM like chat GPT and to learn a new task without fine tuning or retraining the model on hundreds of GPUs. So it learns from analogies or examples. In this case, you can say, you know, here's the sentence, the acting was superb. The sentiment is positive. Another sentence the special effects were terrible sentiment is negative. Now, from those two shots. How well can it learn. And now you're giving an example characters are well developed. Is this a positive sentiment or a negative sentiment. Here's the structure for the query you give a name review colon statement sentiment colon statement and give it a few examples. And when you give the sentiment as a question mark. It comes back with positive. That's chat GPT 3.5. The chat GPT for goes quite a bit further. Characters are well developed. The positive review praises an aspect of the performance for the negative criticizes inferred sentiment for the new review is positive. So it explains why it's rationale. So it's much more verbose than chat GPT 3.5 but both of them are able to do sentiment analysis, which is one of the harder things to do. You can then go beyond just simply sentiment analysis and apply for information extraction by giving examples to be extracted. So this is something that Sagan development where he's taken some query engineering or in context to extract biomedical data or bio tremendous applications. So here's a sentence. The LCT gene provides instructions for making an enzyme called lactase. Here's some relations. LCT gene lactase and that encodes to lactase is an enzyme. Here's another example. HSP 70 encoded by three closely related parallax HSP a one a, a one B and a one L is a stress index protein. Here's some relations. It's giving you a game sort of an ontology instruction. And then your third example is the query. What is up to opt for. And how would you extract the information about it. So from that, GTP 3.5 tells you that poof, poof five F one is a gene that opt for encodes and opt for is a protein. GTP for does a little bit more. It's again more verbose. It explains things. And rationalize what it did and it basically gives the same result. But explains that poof five F one is known as for. So you can apply it to something even more complicated and this is where we're doing extraction with a more complicated sentence about four periodoxic acid and what sort of classic component is how it's what it looks like what it's what it's drivetized with that it's a catapult product of vitamin B six and then it's excreted in the urine. And so by giving more examples with more complex outputs and extraction criteria, we can see that from that sense we can extract chemical classifications structure information location in bio fluids chemical relationships and so on. So this is, you know, it's a dense sense or two with lots of information and through GPT three with, I guess this is Da Vinci is that right Sagan. Yes, this was the Da Vinci model that was fine tuned. So this is a little further than what you can do with career engineering but but it is an example of, you know, where you can potentially take it. So gain, encourage people after I finish here to maybe give things a few tries about, you know, what can I do if I query engineer, you won't be able to do fine tuning. But I think you've given some examples of what you can do with. I see how. Now, when you wanted to information extraction is a great, you know, now I can extract everything I want from all of the literature. No, not quite so fast so chat to be three can only take in about 1500 words chat three point five about 3000 words chat before about 6000 words. And that limits both your input and your output so that's not an entire corpus of all known scientific literature. So, you know, because the queries have to be complicated and trained that means you're not able to get things done quickly. If you wanted to be able to process lots of things that you could submit queries so you can have a paper broken down into three batches of 4000 tokens and you can feed them in progressively and if you have hundreds of papers you could feed them in sort of like you'd strip through a feeder. So it's not as if you're going to feed in hundreds of millions of words in one query, but you'd have to do this repeatedly. So that's possible. However, if you do the calculation, and you wanted to say, extract all of PubMed abstracts, because there's about 10 billion words in PubMed abstracts, it'll cost you $30 million. Now, you don't have to look at all of PubMed, you can obviously select and sub select certain parts as a corpus and feed smaller amounts of data. So if you said, I only want to cover breast cancer from, you know, 2010 and later, and only on a bracket one subjects, you might be able to win it down to a few thousand abstracts which translates to maybe 500,000 words, which is something might only cost a few hundred dollars a year, if you wanted to cover that. So that's it for chat GPT. It's also kind of the end of the course. I did want to cover, or maybe before I wrap up, maybe I'll ask people if they had any comments or questions about chat GPT or if they tried anything and got some interesting results. Okay. I think we're pretty much at the end of the day here. I did want to just make sure we'll wrap up in terms of final comments. We're going to have a quick review, but I just wanted to give you the whole point of this is to give you some flavor of the different tools. We wanted to give you guys some code so you can reuse it if you wanted to do it. Machine learning is presented in terms of toy problems, really trivial ones. And while the Irish problem is simple, it's not trivial. It is biologically related. We tried to give you some real bioinformatics examples, real bioinformatics problems. I think we've also shown you about the structure neural nets. We've tried to show that coding everything from scratch is pretty hard. And at least now that you know that it how difficult it can be, you appreciate things like SK learn and Keras and TensorFlow. But it's still something that you do need to know coding, although certainly with chat GPT. A lot of that tedious coding might be something that can do for you. We talked about classification, some of the regression calculations. These are things that are being webified and I showed you guys examples of the table analyst that does that now, so that you don't miss that I have to code. But there are other classes of machine learning problems from gene prediction protein structure prediction that are very specialized and those are probably too specialized to be general web servers. The last thing we learned about was just the chat box, and it'll certainly change the field of bioinformatics, it'll change coding. It'll change how we gather information. I think some level it's going to be good but in many levels I think there's a general expectation that there's going to be a lot of job losses for certain categories of specialists. And how that'll fall out, we don't know for sure. But as you guys learn more about this, you'll probably start appreciating where you should be and where you shouldn't be in the field of computing. Okay, so I want to thank everyone for sticking around and for for being part of this workshop. And thanks for all of your efforts and engagement and really engaging questions as well.